jmlr jmlr2006 jmlr2006-1 knowledge-graph by maker-knowledge-mining

1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

Source: pdf

Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır

Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classiﬁer, whose key component is a weight vector in a feature space implicitly introduced by a positive deﬁnite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modiﬁed optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classiﬁers. These classiﬁers essentially ﬁnd a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the diﬀerent classes of data are linearly well separated. Experimental results over several classiﬁcation benchmarks demonstrate the eﬀectiveness of our approach. Keywords: sparse learning, sparse large margin classiﬁers, kernel learning algorithms, support vector machine, kernel Fisher discriminant

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classiﬁers. [sent-14, score-0.187]

2 These classiﬁers essentially ﬁnd a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the diﬀerent classes of data are linearly well separated. [sent-15, score-0.076]

3 Keywords: sparse learning, sparse large margin classiﬁers, kernel learning algorithms, support vector machine, kernel Fisher discriminant 1. [sent-17, score-0.359]

4 Introduction Many kernel learning algorithms (KLA) have been proposed for solving diﬀerent kinds of problems. [sent-18, score-0.076]

5 , 2002) is another competitive algorithm for classiﬁcation, the one-class SVM (Sch¨lkopf o and Smola, 2002) is a useful tool for novelty detection, while the kernel Fisher discriminant (KFD) (Mika et al. [sent-20, score-0.076]

6 , 2003) and the kernel PCA (KPCA) (Sch¨lkopf and Smola, 2002) are o powerful algorithms for feature extraction. [sent-21, score-0.076]

7 Many kernel learning algorithms result in a kernel machine (KM) (such as a kernel classiﬁer), whose output can be calculated as NXV τ (x) = αi K(ˆ i , x) + b, ˆ x i=1 c 2006 Mingrui Wu, Bernhard Sch¨lkopf and G¨khan Bakır. [sent-22, score-0.248]

8 Usually K is a positive deﬁnite kernel (Sch¨lkopf and Smola, 2002), which implicitly o introduces a feature space F. [sent-24, score-0.076]

9 Hence (1) can also be written as a linear function τ (x) = w, φ(x) + b, where (2) NXV w= αi φ(ˆ i ) ˆ x (3) i=1 is the weight vector of the KM and it equals the linear expansion of XVs in the feature space F. [sent-26, score-0.084]

10 First, each of them results in a KM whose key component is a weight vector w, which can be expressed as the linear expansion of XVs in the feature space F. [sent-32, score-0.084]

11 Thus several sparse learning algorithms have been proposed to build KMs with small NXV . [sent-38, score-0.129]

12 The xi , 1 ≤ i ≤ NXV have diﬀerent names in diﬀerent kernel learning algorithms. [sent-48, score-0.1]

13 In this paper we uniformly call them expansion vectors for the sake of simplicity. [sent-50, score-0.087]

14 In (Lee and Mangasarian, 2001), the reduced support vector machine (RSVM) algorithm is proposed, which randomly selects Nz vectors from the training set as XVs, and then computes the expansion coeﬃcients. [sent-52, score-0.119]

15 This algorithm can be applied to build sparse kernel classiﬁers. [sent-53, score-0.205]

16 The relevance vector machine (RVM) (Tipping, 2001) is another algorithm which leads to sparse KMs. [sent-55, score-0.106]

17 The basic idea of the RVM is to assume a prior of the expansion coeﬃcients which favors sparse solutions. [sent-56, score-0.156]

18 In this paper, based on the common points of KLAs mentioned before, we present a direct method to build sparse kernel learning algorithms (SKLA). [sent-57, score-0.229]

19 We will also show that applying this method to the SVM will result in a speciﬁc algorithm for building sparse large margin classiﬁers (SLMC). [sent-60, score-0.187]

20 In section 2, we describe a direct method for building SKLAs. [sent-62, score-0.092]

21 The SLMC algorithm is presented in section 3, where we will also point out that it actually ﬁnds a discriminating subspace of the feature space F. [sent-64, score-0.076]

22 A Direct Method of Building SKLAs In this section, we propose a direct method for building SKLAs. [sent-68, score-0.092]

23 605 ¨ Wu, Scholkopf and Bakır Nξ is the number of auxiliary variables, Ng is the number of inequality constraints speciﬁed by gi (w, b, ξ), while Nh is the number of equality constraints speciﬁed by hj (w, b, ξ). [sent-80, score-0.135]

24 To achieve this, we propose to solve the following problem: min w,b,ξ,β,Z f (w, b, ξ), (8) gi (w, b, ξ) ≤ 0, 1 ≤ i ≤ Ng , (9) hj (w, b, ξ) = 0, subject to 1 ≤ j ≤ Nh , (10) Nz w= φ(zi )βi , (11) i=1 where (8)–(10) are exactly the same as (5)–(7), while Z = [z1 , . [sent-82, score-0.152]

25 It can be seen that the above problem is the problem (5)–(7) with one added constraint (11) saying that the weight vector of the resulting KM equals the expansion of the φ(zi ), 1 ≤ i ≤ Nz . [sent-89, score-0.084]

26 Due to the constraint (11), solving the problem (8)–(11) will naturally lead to a sparse KM. [sent-91, score-0.088]

27 However, our gradient based minimization will be performed only on the expansion vectors Z but not on all the variables. [sent-94, score-0.121]

28 To this end, we deﬁne the following the marginal function W (Z) which is obtained by keeping the expansion vectors in problem (8)–(11) ﬁxed, i. [sent-95, score-0.087]

29 : W (Z) := min w,b,ξ,β f (w, b, ξ), (12) gi (w, b, ξ) ≤ 0, 1 ≤ i ≤ Ng , (13) hj (w, b, ξ) = 0, subject to 1 ≤ j ≤ Nh , (14) Nz and w= φ(zi )βi . [sent-97, score-0.152]

30 According to the constraint (11), the weight vector w can be completely determined by β and Z, so the functions f , gi and hj can be regarded as functions of b, ξ, β and Z. [sent-109, score-0.151]

31 Without causing confusions, we write them as f (x, Z), gi (x, Z) and hj (x, Z) in the following, where x := [b, ξ , β ] . [sent-110, score-0.135]

32 Substituting (15) into (12)–(14), we have min x f (x, Z), (16) gi (x, Z) ≤ 0, 1 ≤ i ≤ Ng , (17) hj (x, Z) = 0, subject to 1 ≤ j ≤ Nh , (18) and W (Z) is the optimal value of the above optimization problem. [sent-111, score-0.152]

33 Furthermore let αi and βj be the unique corresponding ¯ Lagrange multipliers associated with gi and hj respectively, 1 ≤ i ≤ Ng , 1 ≤ j ≤ Nh . [sent-113, score-0.153]

34 Building an SLMC Having described our direct method for building SKLAs, we will focus on a concrete application of this method in the rest of this paper. [sent-121, score-0.092]

35 1 Objective Now we begin to consider the binary classiﬁcation problem, where we are given a set of training data {(xi , yi )}N , where xi ∈ X is the input data, and yi ∈ {−1, 1} is the class i=1 label. [sent-126, score-0.154]

36 Our objective is as follows: given a training data set and an positive integer Nz , we want to build a classiﬁer such that the number of XVs of the classiﬁer equals Nz and the margin of the classiﬁer is as large as possible. [sent-127, score-0.104]

37 This way we build a large margin classiﬁer whose sparseness is explicitly controlled. [sent-128, score-0.097]

38 ) Substituting (27) into (24) and (25), we have 1 β Kz β + C 2 min ξ,b,β N ξi , yi (β ψz (xi ) + b) ≥ 1 − ξi , subject to ξi ≥ 0, (28) i=1 ∀i, ∀i, (29) (30) where ψz (xi ) = [K(z1 , xi ), . [sent-158, score-0.09]

39 , K(zNz , xi )] (31) is the empirical kernel map (Sch¨lkopf and Smola, 2002) and Kz is the kernel matrix of zi , o z = K(z , z ). [sent-161, score-0.251]

40 Kij i j Note that when Nz = N and zi = xi , 1 ≤ i ≤ N , this is the standard SVM training problem. [sent-164, score-0.131]

41 In contrast, the problem (28)–(30) is to train a linear SVM in a subspace spanned by φ(zi ), 1 ≤ i ≤ Nz , where zi are are not necessarily training examples. [sent-165, score-0.151]

42 609 ¨ Wu, Scholkopf and Bakır The derivatives of L(ξ, b, β, α, γ) with respect to the primal variables must vanish, N ∂L = Kz β − ∂β ∂L =− ∂b αi yi ψz (xi ) = 0, (33) ∀i, (34) i=1 N αi yi = 0, i=1 ∂L = C − αi − γi = 0, ∂ξi ∀i. [sent-168, score-0.098]

43 (40) ˆ The function Kz (·, ·) deﬁned by (40) is a positive deﬁnite kernel function (Sch¨lkopf and o Smola, 2002). [sent-171, score-0.076]

44 Therefore given Z, computing the expansion coeﬃcients of SVM with kernel 4. [sent-176, score-0.144]

45 The map deﬁned in (41) is called the “whitened” empirical kernel map or “kernel PCA map”(Sch¨lkopf o and Smola, 2002). [sent-177, score-0.076]

46 610 A Direct Method for Building Sparse Kernel Learning Algorithms ˆ function K is equivalent to training an SVM with a modiﬁed kernel function Kz deﬁned by (40). [sent-178, score-0.108]

47 So given Z, assuming αi , 1 ≤ i ≤ N are the solution of (37)–(39), then we can compute W (Z) as N z αi − W (Z) = i=1 1 2 N N z z ˆ αi αj yi yj Kz (xi , xj ). [sent-180, score-0.119]

48 i (44) j=1 According to (36), the expansion coeﬃcients β can be calculated as N β = (Kz )−1 z αi yi ψz (xi ) = (Kz )−1 (Kzx )Yαz , (45) i=1 where ψz (·) is deﬁned by (31), Kzx is the matrix deﬁned by Kzx = K(zi , xj ), Y is a diagonal ij z z matrix of class labels, i. [sent-181, score-0.189]

49 As can be seen in corollary 2, to apply lemma 1 to calculate W (Z), we only need to ˆ make two assumptions on problem (37)–(39): The kernel function Kz (·, ·) is strictly positive deﬁnite and the resulting support vectors come from both classes. [sent-192, score-0.095]

50 4 The Kernel Function Kz and Its Corresponding Feature Space Fz ˆ The kernel function Kz plays an important role in our approach. [sent-210, score-0.076]

51 It is well known that training an SVM with a nonlinear kernel function K in the input space X is equivalent to building a linear SVM in a feature space F. [sent-212, score-0.176]

52 , zNz , training an SVM with kernel function K is equivalent to building an SVM ˆ with another kernel function Kz , which is in turn equivalent to constructing a linear SVM in another feature space. [sent-218, score-0.252]

53 Based on this point of view, we can see that our proposed approach essentially ﬁnds a subspace Fz where the margin of the training data is maximized. [sent-237, score-0.107]

54 , zNz are obtained, the expansion coeﬃcients β are computed by minimizing (4), which leads to (Sch¨lkopf and Smola, 2002) o β = (Kz )−1 (Kzx )Yα, (47) where Kzx and Y are deﬁned as in (45), and α is the solution of building an SVM with kernel function K on the training data set {(xi , yi )}N . [sent-244, score-0.293]

55 The only diﬀerence is that in (47), α is the solution of training an SVM with kernel function K, while in (45), αz is the solution of training an SVM with the kernel ˆ function Kz , which takes the XVs zi into consideration. [sent-247, score-0.291]

56 2 Comparison with RSVM and a Modiﬁed RSVM Algorithm One might argue that our approach appears to be similar to the RSVM, because the RSVM algorithm also restricts the weight vector of the decision hyperplane to be a linear expansion of Nz XVs. [sent-251, score-0.084]

57 The ﬁrst one (and probably the fundamental one) is that in the RSVM approach, Nz XVs are randomly selected from the training data in advance, but are not computed by ﬁnding a discriminating subspace Fz . [sent-253, score-0.108]

58 This step immediately reduces the problem (28)–(30) to a standard linear SVM training problem (Lin and Lin, 2003), where β becomes the weight vector of the decision hyperplane and the training set becomes {ψz (xi ), yi }N . [sent-257, score-0.129]

59 i=1 On the other hand, our method of computing β is to build a linear SVM in the subspace Fz , which is to train a linear SVM for the training data set {φz (xi ), yi }N . [sent-258, score-0.166]

60 Analogous to the modiﬁed RS method, we propose a modiﬁed RSVM algorithm: Firstly, Nz training data are randomly selected as XVs, then the expansion coeﬃcients β are computed by (45). [sent-264, score-0.1]

61 3 Comparison with the RVM The RVM (Tipping, 2001) algorithm and many other sparse learning algorithms, such as sparse greedy algorithms (Nair et al. [sent-266, score-0.176]

62 Consequently, with the same number of XVs, SLMC can have better classiﬁcation performance than the RVM and other sparse learning algorithms which select the XVS only from the training data. [sent-270, score-0.12]

63 4 SLMC vs Neural Networks Since the XVs of the SLMC do not necessarily belong to the training set and training an SLMC is a gradient based process,7 the SLMC can be thought of as a neural network with weight regularization (Bishop, 1995). [sent-274, score-0.114]

64 To this end, the regularizer takes into account the kernel matrix Kz . [sent-277, score-0.076]

65 It is straightforward to apply this method to build sparse one-class SVM algorithm (Sch¨lkopf o and Smola, 2002), to which there is no obvious neural network counterpart. [sent-281, score-0.129]

66 Note that the prediction time (the number of kernel evaluations) of a soft margin SVM scales linearly with the number of training patterns (Steinwart, 2003). [sent-283, score-0.139]

67 This is also similar to the recent work of (Snelson and Ghahramani, 2006) on building sparse Gaussian processes, which was done at almost the same time with our previous work (Wu et al. [sent-287, score-0.156]

68 3 Parameter Selection A Gaussian kernel is used in the experiments: K(x, x ) = exp(−γ x−x 2 ). [sent-310, score-0.076]

69 RS method: The RS method uses the same kernel parameter as the standard SVM, since it aims to simplify the standard SVM solution. [sent-315, score-0.076]

70 In Table 1, NSV stands for the number of support vectors (SVs) of the standard SVM, Nz represents the number of XVs of other sparse learning algorithms. [sent-342, score-0.107]

71 For each data set, the result of the RVM is shown in boldface if it is the best compared to the other sparse learning algorithms. [sent-446, score-0.088]

72 Table 1 also illustrates that SLMC outperforms the other sparse learning algorithms in most cases. [sent-449, score-0.088]

73 Theoretically if the Gaussian kernel is used and zi , 1 ≤ i ≤ Nz , are diﬀerent from each other, then Kz should be full rank, whose inverse can be computed without any problems. [sent-496, score-0.151]

74 8 Training Time of SLMC Building an SLMC is a gradient based process, where each iteration consists of computing the φz (xi ), 1 ≤ i ≤ N , training a linear SVM over {φz (xi ), yi }N ,14 and then computing i=1 the gradient W (Z). [sent-520, score-0.149]

75 Let TSV M (N, d) denote the time complexity of training a linear SVM over a data set containing N d-dimensional vectors, then the time complexity of training an SLMC is 3 2 O(n × (N Nz d + TSV M (N, Nz ) + Nz + Nz d)), where n is the number of iterations of the SLMC training process. [sent-521, score-0.096]

76 For applications where processing speed is important, such as real time computer vision, sparse KMs can be of great value. [sent-531, score-0.088]

77 Furthermore, some kernel learning algorithms are not sparse at all, such as kernel ridge regression, KFD, KPCA, which means that all the training data need to be saved as the XVs in the resulting KMs trained by these algorithms. [sent-533, score-0.272]

78 Hence building sparse versions of these algorithms can not only accelerate the evaluation of the test data, but also dramatically ˆ 14. [sent-534, score-0.156]

79 Equivalently we can build an SVM with the kernel function Kz over {xi , yi }N . [sent-535, score-0.166]

80 Conclusions We present a direct method to build sparse kernel learning algorithms. [sent-540, score-0.229]

81 There are mainly two advantages of this method: First, it can be applied to sparsify many current kernel learning algorithms. [sent-541, score-0.094]

82 Based on this method we propose an approach to build sparse large margin classiﬁers, which essentially ﬁnds a discriminating subspace Fz of the feature space F. [sent-543, score-0.236]

83 Experimental results indicate that this approach often exhibits a better classiﬁcation accuracy, at comparable sparsity, than the sparse learning algorithms to which we compared. [sent-544, score-0.088]

84 Possible future work may include applying the proposed method to other kernel learning algorithms and running the direct method greedily to ﬁnd the XVs one after another in order to accelerate the training procedure. [sent-549, score-0.132]

85 Sparse KFD We have derived the SLMC algorithm as an example of our direct sparse learning approach. [sent-552, score-0.112]

86 Here we will show another example to build sparse KFD (Mika et al. [sent-553, score-0.129]

87 Following our proposed approach, in order to build sparse KFD (SKFD), we have to solve the following problem: −w Aw, subject to (52) ˆ w Bw = 1, min w∈F ,β∈RNz ,Z∈Rd×Nz (53) Nz w= φ(zi )βi . [sent-566, score-0.146]

88 Therefore according to lemma 1, the derivative of W (Z) with respect to zuv , which denotes the v-th component of vector zu , 1 ≤ u ≤ Nz , 1 ≤ v ≤ d, exists and can be computed as follows: z z ∂W (Z) ¯ ∂A β + λβ ∂B β. [sent-587, score-0.123]

89 After obtaining the XVs zi by solving the problem (52)–(54), we can take the solution βi of this problem as the expansion coeﬃcients. [sent-589, score-0.143]

90 As mentioned before, having zi , we can apply our proposed method that the expansion coeﬃcients and the bias can be calculated by solving the problem (37)–(39). [sent-592, score-0.163]

91 The best results among the sparse learning algorithms are in boldface. [sent-644, score-0.088]

92 For the KFD algorithm, the number of expansion vectors Nz is the same as the number of training data since KFD is not sparse at all. [sent-646, score-0.207]

93 Furthermore, the original KFD algorithm is not sparse all all, i. [sent-648, score-0.088]

94 Formal Deﬁnitions of the Two Conditions in Lemma 1 For each Z ∈ Rd×Nz , let S(Z) denote the feasible set of problem (16)–(18) S(Z) = {x | gi (x, Z) ≤ 0, 1 ≤ i ≤ Ng } ∩ {x | hj (x, Z) = 0, 1 ≤ j ≤ Nh }. [sent-658, score-0.152]

95 The gradients { x gi (x, Z)|x=¯ x x hj (x, Z)|x=¯ x = 0, 1 ≤ j ≤ Nh , (65) x hj (x, Z)|x=¯ , x 1 ≤ j ≤ Nh } are linearly independent, denotes the gradient with respect to the variables x. [sent-662, score-0.256]

96 First, for the soft margin SVM training problem (37)–(39), it is known that if the kernel ˆ function Kz (·, ·) is strictly positive deﬁnite then the problem has an unique solution and the corresponding Lagrange multipliers are also unique. [sent-665, score-0.157]

97 In this case, the vector vs3 can be easily computed as vs3 = − ys3 ys3 vi yi , 2 i∈S1 ∪S2 where y = [y1 , . [sent-682, score-0.081]

98 Therefore the left side of equation (65) equals v y = vi yi + i∈S1 = vi yi + i∈S2 , yi >0 i∈S2 , yi <0 vi − vi yi + i∈S1 vi yi i∈S2 , yi >0 vi . [sent-689, score-0.535]

99 (66) i∈S2 , yi <0 Recall that we construct v such that vi < 0 for i ∈ S2 . [sent-690, score-0.081]

100 Some greedy learning algorithms for sparse regression and classiﬁcation with Mercer kernels. [sent-787, score-0.088]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nz', 0.511), ('slmc', 0.388), ('kz', 0.379), ('xvs', 0.37), ('rsvm', 0.147), ('coe', 0.135), ('rs', 0.134), ('zuv', 0.123), ('bak', 0.115), ('kfd', 0.112), ('rvm', 0.112), ('svm', 0.102), ('znz', 0.097), ('sparse', 0.088), ('hj', 0.087), ('az', 0.082), ('mrsvm', 0.079), ('kernel', 0.076), ('nh', 0.075), ('zi', 0.075), ('km', 0.072), ('mrs', 0.07), ('nxv', 0.07), ('rnz', 0.07), ('building', 0.068), ('expansion', 0.068), ('bz', 0.068), ('fz', 0.068), ('wu', 0.064), ('erent', 0.064), ('klas', 0.062), ('skfd', 0.062), ('uz', 0.06), ('scholkopf', 0.059), ('di', 0.055), ('xj', 0.052), ('yi', 0.049), ('gi', 0.048), ('kla', 0.044), ('kzx', 0.044), ('subspace', 0.044), ('banana', 0.043), ('build', 0.041), ('nsv', 0.037), ('usps', 0.037), ('gradient', 0.034), ('lkopf', 0.034), ('vi', 0.032), ('discriminating', 0.032), ('training', 0.032), ('margin', 0.031), ('mika', 0.03), ('ng', 0.03), ('sch', 0.029), ('tipping', 0.029), ('zj', 0.027), ('gauvin', 0.026), ('kms', 0.026), ('sklas', 0.026), ('classi', 0.026), ('sparseness', 0.025), ('smola', 0.025), ('titanic', 0.025), ('direct', 0.024), ('xi', 0.024), ('lagrange', 0.023), ('svs', 0.022), ('rd', 0.022), ('german', 0.021), ('bw', 0.02), ('kpca', 0.02), ('cients', 0.02), ('calculated', 0.02), ('compact', 0.019), ('breast', 0.019), ('waveform', 0.019), ('vectors', 0.019), ('lin', 0.019), ('benchmarks', 0.018), ('erence', 0.018), ('yj', 0.018), ('regularity', 0.018), ('relevance', 0.018), ('dubeau', 0.018), ('erentiable', 0.018), ('mingrui', 0.018), ('nair', 0.018), ('snelson', 0.018), ('sparsify', 0.018), ('tsv', 0.018), ('multipliers', 0.018), ('subject', 0.017), ('feasible', 0.017), ('aw', 0.017), ('suykens', 0.017), ('cancer', 0.017), ('weight', 0.016), ('duplicated', 0.015), ('erences', 0.015), ('khan', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır

2 0.071299292 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Author: Sayan Mukherjee, Qiang Wu

Abstract: We introduce an algorithm that simultaneously estimates a classiﬁcation function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to ﬁnd salient variables and estimate how they covary. An efﬁcient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classiﬁcation function and its gradient to the true classiﬁcation function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classiﬁcation

3 0.068224043 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

Author: Tonatiuh Peña Centeno, Neil D. Lawrence

Abstract: In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. We relate Rayleigh’s coefﬁcient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefﬁcient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefﬁcient with the generalisation error.

4 0.068195499 45 jmlr-2006-Learning Coordinate Covariances via Gradients

Author: Sayan Mukherjee, Ding-Xuan Zhou

Abstract: We introduce an algorithm that learns gradients from samples in the supervised learning framework. An error analysis is given for the convergence of the gradient estimated by the algorithm to the true gradient. The utility of the algorithm for the problem of variable selection as well as determining variable covariance is illustrated on simulated data as well as two gene expression data sets. For square loss we provide a very efﬁcient implementation with respect to both memory and time. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds

5 0.064161845 18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)

Author: S. Sathiya Keerthi, Olivier Chapelle, Dennis DeCoste

Abstract: Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classiﬁcation speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily ﬁnds a set of kernel basis functions of a speciﬁed maximum size (dmax ) to approximate the SVM primal cost function well; (3) it is 2 efﬁcient and roughly scales as O(ndmax ) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors. Keywords: SVMs, classiﬁcation, sparse design

6 0.055098962 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

7 0.04844559 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

8 0.046623401 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

9 0.03866655 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

10 0.038473669 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)

11 0.038321797 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

12 0.037134979 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

13 0.036335133 44 jmlr-2006-Large Scale Transductive SVMs

14 0.035653934 93 jmlr-2006-Universal Kernels

15 0.034977231 31 jmlr-2006-Exact 1-Norm Support Vector Machines Via Unconstrained Convex Differentiable Minimization (Special Topic on Machine Learning and Optimization)

16 0.034065124 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

17 0.033845387 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

18 0.03324255 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

19 0.029898595 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

20 0.029667018 72 jmlr-2006-Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems (Special Topic on Machine Learning and Optimization)

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.155), (1, -0.097), (2, 0.104), (3, -0.053), (4, 0.114), (5, -0.086), (6, 0.006), (7, -0.019), (8, 0.049), (9, -0.053), (10, 0.07), (11, -0.036), (12, 0.044), (13, -0.035), (14, 0.04), (15, -0.005), (16, 0.037), (17, 0.003), (18, -0.023), (19, -0.042), (20, 0.171), (21, -0.058), (22, -0.194), (23, 0.004), (24, 0.083), (25, -0.002), (26, -0.063), (27, -0.252), (28, 0.076), (29, 0.13), (30, 0.079), (31, -0.012), (32, -0.146), (33, -0.125), (34, -0.173), (35, -0.064), (36, 0.182), (37, 0.03), (38, 0.151), (39, 0.11), (40, -0.229), (41, -0.082), (42, -0.132), (43, -0.058), (44, -0.173), (45, -0.015), (46, 0.206), (47, 0.162), (48, 0.108), (49, -0.215)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8987174 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır

2 0.46189052 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

Author: Tonatiuh Peña Centeno, Neil D. Lawrence

3 0.44075575 18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)

Author: S. Sathiya Keerthi, Olivier Chapelle, Dennis DeCoste

4 0.28775811 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, Bernhard Schölkopf

Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classiﬁcation, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-inﬁnite linear program that can be efﬁciently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classiﬁcation. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun. Keywords: multiple kernel learning, string kernels, large scale optimization, support vector machines, support vector regression, column generation, semi-inﬁnite linear programming

5 0.27823925 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classiﬁcation where the documents are allowed to belong to more than one category at a time. The classiﬁcation model is a variant of the Maximum Margin Markov Network framework, where the classiﬁcation hierarchy is represented as a Markov tree equipped with an exponential family deﬁned on the edges. We present an efﬁcient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classiﬁcation hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efﬁcient as training independent SVM-light classiﬁers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classiﬁcation learning algorithms. Keywords: kernel methods, hierarchical classiﬁcation, text categorization, convex optimization, structured outputs

6 0.23596665 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

7 0.23098435 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

8 0.209806 31 jmlr-2006-Exact 1-Norm Support Vector Machines Via Unconstrained Convex Differentiable Minimization (Special Topic on Machine Learning and Optimization)

9 0.20458177 27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming (Special Topic on Machine Learning and Optimization)

10 0.20270655 45 jmlr-2006-Learning Coordinate Covariances via Gradients

11 0.18829574 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

12 0.18541811 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

13 0.17170912 69 jmlr-2006-One-Class Novelty Detection for Seizure Analysis from Intracranial EEG

14 0.17049161 93 jmlr-2006-Universal Kernels

15 0.1604635 44 jmlr-2006-Large Scale Transductive SVMs

16 0.16010417 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

17 0.15258145 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies

18 0.1438136 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

19 0.14227626 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

20 0.13996112 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.056), (35, 0.013), (36, 0.05), (45, 0.012), (46, 0.015), (50, 0.033), (53, 0.386), (61, 0.013), (63, 0.038), (71, 0.013), (76, 0.016), (78, 0.043), (81, 0.017), (84, 0.023), (90, 0.06), (91, 0.046), (96, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.65400165 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır

2 0.62353987 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

Author: Pieter Abbeel, Daphne Koller, Andrew Y. Ng

Abstract: We study the computational and sample complexity of parameter and structure learning in graphical models. Our main result shows that the class of factor graphs with bounded degree can be learned in polynomial time and from a polynomial number of training examples, assuming that the data is generated by a network in this class. This result covers both parameter estimation for a known network structure and structure learning. It implies as a corollary that we can learn factor graphs for both Bayesian networks and Markov networks of bounded degree, in polynomial time and sample complexity. Importantly, unlike standard maximum likelihood estimation algorithms, our method does not require inference in the underlying network, and so applies to networks where inference is intractable. We also show that the error of our learned model degrades gracefully when the generating distribution is not a member of the target class of networks. In addition to our main result, we show that the sample complexity of parameter learning in graphical models has an O(1) dependence on the number of variables in the model when using the KL-divergence normalized by the number of variables as the performance criterion.1 Keywords: probabilistic graphical models, parameter and structure learning, factor graphs, Markov networks, Bayesian networks

3 0.31530565 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This uniﬁed view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any ﬁxed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.

4 0.30946344 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

Author: Tonatiuh Peña Centeno, Neil D. Lawrence

5 0.30558878 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)

Author: Ben Taskar, Simon Lacoste-Julien, Michael I. Jordan

Abstract: We present a simple and scalable algorithm for maximum-margin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem that allows us to use simple projection methods based on the dual extragradient algorithm (Nesterov, 2003). The projection step can be solved using dynamic programming or combinatorial algorithms for min-cost convex ﬂow, depending on the structure of the problem. We show that this approach provides a memoryefﬁcient alternative to formulations based on reductions to a quadratic program (QP). We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm.1 Keywords: Markov networks, large-margin methods, structured prediction, extragradient, Bregman projections

6 0.30272403 51 jmlr-2006-Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming (Special Topic on Machine Learning and Optimization)

7 0.30231628 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

8 0.30217156 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

9 0.30166769 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

10 0.30163127 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

11 0.30042681 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

12 0.29781502 44 jmlr-2006-Large Scale Transductive SVMs

13 0.29652423 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

14 0.29529604 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)

15 0.29442731 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

16 0.29353952 18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)

17 0.29188934 14 jmlr-2006-An Efficient Implementation of an Active Set Method for SVMs (Special Topic on Machine Learning and Optimization)

18 0.29084799 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

19 0.28979784 72 jmlr-2006-Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems (Special Topic on Machine Learning and Optimization)

20 0.28807247 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models (Special Topic on Machine Learning and Optimization)