nips nips2002 nips2002-106 knowledge-graph by maker-knowledge-mining

106 nips-2002-Hyperkernels


Source: pdf

Author: Cheng S. Ong, Robert C. Williamson, Alex J. Smola

Abstract: We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au   ¡ Abstract We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. [sent-8, score-0.426]

2 A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. [sent-9, score-0.202]

3 By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e. [sent-10, score-0.751]

4 of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. [sent-12, score-0.333]

5 Various classical settings for model or kernel selection are special cases of our framework. [sent-13, score-0.323]

6 1 Introduction Choosing suitable kernel functions for estimation using Gaussian Processes and Support Vector Machines is an important step in the inference process. [sent-14, score-0.373]

7 To date, there are few if any systematic techniques to assist in this choice. [sent-15, score-0.042]

8 Even the restricted problem of choosing the “width” of a parameterized family of kernels (e. [sent-16, score-0.287]

9 The restriction mentioned is that the methods work with the kernel matrix, rather than the kernel itself. [sent-20, score-0.646]

10 Furthermore, whilst demonstrably improving the performance of estimators to some degree, they require clever parameterization and design to make the method work in the particular situations. [sent-21, score-0.102]

11 There are still no general principles to guide the choice of a) which family of kernels to choose, b) efficient parameterizations over this space, and c) suitable penalty terms to combat overfitting. [sent-22, score-0.238]

12 Whilst not yet providing a complete solution to these problems, this paper presents a framework that allows the optimization within a parameterized family relatively simply, and crucially, intrinsically captures the tradeoff between the size of the family of kernels and the sample size available. [sent-24, score-0.321]

13 Furthermore, the solution presented is for optimizing kernels themselves, rather than the kernel matrix as in [1]. [sent-25, score-0.542]

14 Other approaches on learning the kernel include using boosting [5] and by bounding the Rademacher complexity [6]. [sent-26, score-0.362]

15 We give several examples of hyperkernels (Section 4) and show (Section 5) how they can be used practically. [sent-28, score-0.214]

16 Due to space constraints we only consider Support Vector classification. [sent-29, score-0.026]

17  © ¨¦¥¤£¢      Let denote the set of training data and the set of corresponding labels, jointly drawn iid from some probability distribution on . [sent-35, score-0.065]

18 Furthermore, let and denote the corresponding test sets (drawn from the same ). [sent-36, score-0.065]

19 We introduce a new class of functionals on data which we call quality functionals. [sent-38, score-0.412]

20 Their purpose is to indicate, given a kernel and the training data , how suitable the kernel is for explaining the training data. [sent-39, score-0.754]

21 if there exists a function such that where is the kernel matrix. [sent-42, score-0.351]

22 Given a sufficiently rich class of kernels it is in general possible to find a kernel that attains arbitrarily small for any training set. [sent-44, score-0.537]

23 Analogously to the standard methods of statistical learning theory, we aim to minimize the expected quality functional: b aC cEwW X x %   x i €X y is an empirical quality func(1) is the expected quality functional, where the expectation is taken with respect to . [sent-46, score-0.826]

24 We now present some examples of quality functionals, and derive their exact minimizers whenever possible. [sent-49, score-0.287]

25 Example 1 (Kernel Target Alignment) This quality functional was introduced in [7] to assess the “alignment” of a kernel with training labels. [sent-50, score-0.779]

26 It is defined by (3) om &hk; k g h ¤ug © "¤¥¤£¡  % g "n& kk h h kl h' j)h § k b a § kk h ' e d  § ¡ © C a © c—EC  h ' h )7jih ft&Ge © ¨¦¥¤£¡ %  © ¨¦¥¤£¡    dX ™Ec! [sent-51, score-0.282]

27 ¨ §˜ ¨¥ W g' ¢ ¢ kh h ' ' where denotes the vector of elements of , denotes the norm of , and is the Frobenius norm: . [sent-52, score-0.16]

28 Note that the definition in [7] looks somewhat different, yet it is algebraically identical to (3). [sent-53, score-0.024]

29 ¨c§˜b© "¥C W kk k ' kk h ' e d ' ' kk h ' e d  e § ¡ § £ ¡ ¡ © C a —E h' ' g ‚' g ' ' ¡ jih ¢ (4) for data other than It is clear that one cannot expect that the set chosen to determine . [sent-56, score-0.504]

30 b a § £ § 3 &(2 ( ( £ c˜ EC ¨ ' g 1¦06 g eG) d  g 5 g  2 “ ¥ d & e © "¦¥U¡ %  © ¨¦¥¤£¡    dX D ¦§¨E¦C£ W # ' “ £ i‰ ‰ k h ‰ h 4‡’g ¤ Ge © ¨¦¥¤£¡ % © "¤¥¤£¡   d‰ E¦C£  §  §  ˜ ˆ  k h ‰ f¨©§56 g  2q‰  g " g  2 “ h ¦6 ' ¥ d # X (5) is the RKHS norm of . [sent-58, score-0.066]

31 , [4, 8]) where we know that the minimizer over of (5) can be written as a kernel expansion. [sent-61, score-0.463]

32 For a given loss this leads to the quality functional (6) ¢ ¢ The minimizer of (6) is more difficult to find, since we have to carry out a double minimization over and . [sent-62, score-0.697]

33 For sufficiently large , we can make arbitrarily close to . [sent-65, score-0.033]

34 , we can determine the minimum Even if we disallow setting to zero, by setting of (6) as follows. [sent-66, score-0.044]

35 Then and so ¢ ¢ ¢ ( ¢ ¢ ¢ ¢ ¢ ¢ ¢ Choosing each yields the minimum with respect to . [sent-68, score-0.044]

36 The proof that is the global minimizer of this quality functional is omitted for brevity. [sent-69, score-0.567]

37 ¢ Example 3 (Negative Log-Posterior) In Gaussian processes, this functional is similar to since it includes a regularization term (the negative log prior) and a loss term (the negative log-likelihood). [sent-70, score-0.275]

38 In addition, it also includes the log-determinant of which measures the size of the space spanned by . [sent-71, score-0.026]

39 The quality functional is e § ¡ § ¡ d˜ © ¨¦¥¤£%  © "¤¥¤£¢   ‰ (E¦C£ ˆ # ¢ 3 q q ¢ i X h d ¦ ‰  D ¢ g ‰ d ¦ 06 g ¢ h ‰ ¨$ Yg ' 40X iU –’g e g q 2 p ¥ ' % feG© ¨¦¤% #   $! [sent-72, score-0.427]

40 When we fix , to exclude the above case, we can set r se b e © ¨¦¥¤£%  © ¨¦¥¤£¢   dX F¡¨$"¨Eda˜ —C W § ¡ § ¡ D d bc‰ g'  g‰ d  q gq 4 ¦ ‚' f)h ' D h ' t d tg  q k q ¢ ¢ 6 g ‚' k D jih w€2 ' h' e y u vD 4 wu x% (8) . [sent-74, score-0.081]

41 Under the assumption that the minimum of is attained at , we can see that still leads to the overall . [sent-75, score-0.044]

42 6 g ‰  g " g ' 4&X;  2p i `h e r  ƒ‚4 ¢ ¢ which leads to with respect to minimum of Other examples, such as cross-validation, leave-one-out estimators, the Luckiness framework, the Radius-Margin bound also have empirical quality functionals which can be arbitrarily minimized. [sent-76, score-0.529]

43 The above examples illustrate how many existing methods for assessing the quality of a kernel fit within the quality functional framework. [sent-77, score-1.037]

44 We also saw that given a rich enough class of kernels , optimization of over would result in a kernel that would be useless for prediction purposes. [sent-78, score-0.526]

45 This is yet another example of the danger of optimizing too much — there is (still) no free lunch. [sent-79, score-0.067]

46 x b a cEC W x 3 A Hyper Reproducing Kernel Hilbert Space We now introduce a method for optimizing quality functionals in an effective way. [sent-80, score-0.479]

47 The method we propose involves the introduction of a Reproducing Kernel Hilbert Space on the kernel itself — a “Hyper”-RKHS. [sent-81, score-0.347]

48 X Definition 3 (Reproducing Kernel Hilbert Space) Let be a nonempty set (often called the index set) and denote by a Hilbert space of functions . [sent-85, score-0.098]

49 Then is called a reproducing kernel Hilbert space endowed with the dot product (and the ) if there exists a function satisfying, : norm 1. [sent-86, score-0.672]

50 has the reproducing property for all ; in particular, . [sent-87, score-0.175]

51 The advantage of optimization in an RKHS is that under certain conditions the optimal solutions can be found as the linear combination of a finite number of basis functions, regardless of the dimensionality of the space , as can be seen in the theorem below. [sent-92, score-0.143]

52 5  ‚5 f “ £ ‰ i  9 8 R “ 2 8 d6 k S 6 r  ¢# ¢  admits a representation of the form   £ Theorem 4 (Representer Theorem) Denote by a set, and by increasing function, by function. [sent-96, score-0.04]

53 Then each minimizer of the regularized risk , simply by 8 X  X 8 9 T 8 8 Definition 5 (Hyper Reproducing Kernel Hilbert Space) Let be a nonempty set and let (the compounded index set). [sent-97, score-0.538]

54 Then the Hilbert space of functions , endowed with a dot product (and the norm ) is called a Hyper Reproducing Kernel Hilbert Space if there exists a hyperkernel with the following properties: 1. [sent-98, score-0.471]

55 has the reproducing property for all , in particular, . [sent-99, score-0.175]

56 the hyperkernel is a kernel in its second argument, i. [sent-104, score-0.62]

57 On the other hand, it allows for simple optimization algorithms which consider kernels , which are in the convex cone of . [sent-110, score-0.237]

58 Analogously to the definition of the regularized risk functional (5), we define the regularized quality functional: £ £ X i X X kh Xh (10) ¨ ¦  kh Xh b a W e ˜ e v%     dX cEC TG`%     dX E¤C£ W ¢ where is a regularization constant and denotes the RKHS norm in . [sent-111, score-1.204]

59 Minimization of is less prone to overfitting than minimizing , since the regularization term effectively controls the complexity of the class of kernels under consideration. [sent-112, score-0.216]

60 The question arising immediately from (10) is how to minimize the regularized quality functional efficiently. [sent-114, score-0.647]

61 In the following we show that the minimum can be found as a linear combination of hyperkernels. [sent-115, score-0.044]

62 " X $k# G ˜ £ E¤C`W k h h Corollary 6 (Representer Theorem for Hyper-RKHS) Let be a hyper-RKHS and denote by a strictly monotonic increasing function, by a set, and by an arbitrary quality functional. [sent-117, score-0.33]

63 Then each minimizer of the regularized quality functional (11) W £ 8 S £ i X 6 r  ¢d    ’ 4"h ¤ug 6  56 ¨ 5G 2 )6 hH5¨g€ 52 X &hg; 4 ¥  6 ¨ 5G fX  2  2 # e d ¨ k h X h  1¦ `%     X 3W admits a representation of the form . [sent-118, score-0.827]

64 Then has the properties of a loss function, as it only depends on via its values at . [sent-121, score-0.046]

65 Furthermore, is an RKHS regularizer, so the representer theorem applies and the expansion of follows. [sent-122, score-0.307]

66 k h X h #kG e •%     X YhW g  d X X 6 H5¨€ 2 & hg  h g This result shows that even though we are optimizing over an entire (potentially infinite dimensional) Hilbert space of kernels, we are able to find the optimal solution by choosing among a finite dimensional subspace. [sent-123, score-0.146]

67 The dimension required ( ) is, not surprisingly, significantly larger than the number of kernels required in a kernel function expansion which makes a direct approach possible only for small problems. [sent-124, score-0.575]

68 However, sparse expansion techniques, such as [9, 8], can be used to make the problem tractable in practice. [sent-125, score-0.1]

69 X  f 2 “ g g G –’g  6 Example 4 (Power Series Construction) Denote by a positive semidefinite kernel, and a function with positive Taylor expansion coefficients and by convergence radius . [sent-128, score-0.1]

70 i6  2 X k X k “  § ¨   ¥ is a hyperkernel: for any fixed , a kernel itself (since is a kernel if , where (12)       ¡  6 ¨ "  2 X  Example 5 (Harmonic Hyperkernel) A special case of (12) is the harmonic hyperkernel: Denote by a kernel with (e. [sent-133, score-1.038]

71 , RBF kernels satisfy this property), and for some . [sent-135, score-0.152]

72 , (14) $ % , on $ %     Example 6 (Gaussian Harmonic Hyperkernel) For For norm of (13) ; that is, the expression converges to the Frobenius  We can find further hyperkernels, simply by consulting tables on power series of functions. [sent-138, score-0.118]

73 Recall that expansions such as (12) were mainly chosen for computational convenience, in particular whenever it is not clear which particular class of kernels would be useful for the expansion. [sent-140, score-0.152]

74 P¦ ¡   ¦ ¦ ¦ ¤ ¡ ¤ ¦ ¡  " £ ¡ d  `Y £ ¡  £ i e 2 h 6 fd G e  o AV m 0V ¡ ¤ ¦ k ¡ ¢   d  2 6 f ¡ Table 1: Examples of Hyperkernels Example 7 (Explicit Construction) If we know or have a reasonable guess as to which kernels could be potentially relevant (e. [sent-154, score-0.212]

75 , a range of scales of kernel width, polynomial degrees, etc. [sent-156, score-0.323]

76 Given a convex loss function , the regularized quality functional (16) is convex in . [sent-166, score-0.761]

77 However, this minimum depends on , and for efficient minimization we would like to compute the derivatives with respect to . [sent-168, score-0.128]

78 The following lemma tells us how (it is an extension of a result in [3] and we omit the proof for brevity): ‰ X ( X S #  g “ i6 ¦G q‰  1 2 S X #S i  Lemma 7 Let and denote by convex functions, where is parameterized by . [sent-169, score-0.116]

79 Let be the minimum of the following optimization problem (and denote by its minimizer): 2 3§ d ¢ 6  2 g “ § k 8 and § 1 ‰ 2 6 )6 1 2  q‰ hk 1  # 2 % $! [sent-170, score-0.131]

80 6 4 q‰ 1  6 (1 2  ' i A B @ , where for all c5 4 6 12 ˆ  (1 2 ˆ 7 6 h Then the second argument of . [sent-171, score-0.031]

81 4 ( 4 ¢ ( hg § ¨ 4 Low Rank Approximation While being finite in the number of parameters (despite the optimization over two possibly infinite dimensional Hilbert spaces and ), (19) still presents a formidable optimization problem in practice (we have coefficients for ). [sent-173, score-0.102]

82 For an explicit expansion of type (15) we can optimize in the expansion coefficients of directly, which means that we simply have a quality functional with an penalty on the expansion coefficients. [sent-174, score-0.727]

83 We deliberately did not attempt to tune parameters and instead made the following choices uniformly for all four sets: ¢   ¢    ¢¢      The kernel width was set to , where is the dimensionality of the data. [sent-182, score-0.435]

84 We deliberately chose a too large value in comparison with the usual rules of thumb [8] to avoid good default kernels. [sent-183, score-0.043]

85 was adjusted so that (that is in the Vapnik-style parameterization of SVMs). [sent-184, score-0.029]

86 for the Gaussian Harmonic Hyperkernel was chosen to be throughout, givfocus almost ing adequate coverage over various kernel widths in (13) (small exclusively on wide kernels, close to will treat all widths equally). [sent-186, score-0.427]

87 $ ¨ $ Results Despite the fact that we did not try to tune the parameters we were able to achieve highly competitive results as shown in Table 2. [sent-189, score-0.029]

88 Using the same non-optimized parameters for different data sets we achieved results comparable to other recent work on classification such as boosting, optimized SVMs, and kernel target alignment [10, 11, 7] (note that we use a much smaller part of the data for training: D ! [sent-192, score-0.409]

89 Results based on are comparable to hand tuned SVMs (right most column), except for the ionosphere data. [sent-232, score-0.04]

90 We suspect that this is due to the small training sample. [sent-233, score-0.029]

91 ˜ £ E¦CwW ¡  Summary and Outlook The regularized quality functional allows the systematic solution of problems associated with the choice of a kernel. [sent-234, score-0.689]

92 Quality criteria that can be used include target alignment, regularized risk and the log posterior. [sent-235, score-0.359]

93 The regularization implicit in our approach allows the control of overfitting that occurs if one optimizes over a too large a choice of kernels. [sent-236, score-0.064]

94 A very promising aspect of the current work is that it opens the way to theoretical analyses of the price one pays by optimizing over a larger set of kernels. [sent-237, score-0.067]

95 Current and future research is devoted to working through this analysis and subsequently developing methods for the design of good hyperkernels. [sent-238, score-0.046]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kernel', 0.323), ('hyperkernel', 0.297), ('fx', 0.274), ('quality', 0.262), ('regularized', 0.22), ('dx', 0.199), ('hyperkernels', 0.189), ('reproducing', 0.175), ('functional', 0.165), ('hilbert', 0.161), ('kernels', 0.152), ('functionals', 0.15), ('rkhs', 0.143), ('kk', 0.141), ('representer', 0.141), ('minimizer', 0.14), ('cec', 0.135), ('ec', 0.114), ('risk', 0.113), ('expansion', 0.1), ('ac', 0.099), ('kh', 0.094), ('hyper', 0.086), ('semide', 0.086), ('minimization', 0.084), ('jih', 0.081), ('xh', 0.081), ('harmonic', 0.069), ('optimizing', 0.067), ('norm', 0.066), ('theorem', 0.066), ('regularization', 0.064), ('alignment', 0.06), ('fd', 0.06), ('ug', 0.06), ('kg', 0.06), ('usps', 0.057), ('nition', 0.055), ('ceww', 0.054), ('endowed', 0.054), ('choosing', 0.053), ('av', 0.051), ('optimization', 0.051), ('suitable', 0.05), ('nite', 0.049), ('parameterized', 0.046), ('subsequently', 0.046), ('loss', 0.046), ('whilst', 0.046), ('minimum', 0.044), ('deliberately', 0.043), ('dw', 0.043), ('systematic', 0.042), ('width', 0.04), ('admits', 0.04), ('ionosphere', 0.04), ('widths', 0.04), ('empirical', 0.04), ('boosting', 0.039), ('australian', 0.038), ('pima', 0.038), ('family', 0.036), ('denote', 0.036), ('analogously', 0.036), ('nonempty', 0.036), ('hg', 0.035), ('bousquet', 0.034), ('spans', 0.034), ('frobenius', 0.034), ('convex', 0.034), ('arbitrarily', 0.033), ('coef', 0.032), ('monotonic', 0.032), ('argument', 0.031), ('training', 0.029), ('let', 0.029), ('bc', 0.029), ('parameterization', 0.029), ('tune', 0.029), ('svms', 0.028), ('exists', 0.028), ('series', 0.028), ('ee', 0.027), ('estimators', 0.027), ('target', 0.026), ('xed', 0.026), ('space', 0.026), ('cristianini', 0.025), ('de', 0.025), ('support', 0.025), ('examples', 0.025), ('cients', 0.025), ('processes', 0.025), ('involves', 0.024), ('gaussian', 0.024), ('consulting', 0.024), ('coverage', 0.024), ('algebraically', 0.024), ('iu', 0.024), ('cheng', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 106 nips-2002-Hyperkernels

Author: Cheng S. Ong, Robert C. Williamson, Alex J. Smola

Abstract: We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.

2 0.28780755 156 nips-2002-On the Complexity of Learning the Kernel Matrix

Author: Olivier Bousquet, Daniel Herrmann

Abstract: We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.

3 0.23569286 120 nips-2002-Kernel Design Using Boosting

Author: Koby Crammer, Joseph Keshet, Yoram Singer

Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efficient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a fixed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classification, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classification and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predefined kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predefined kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for specific tasks such as text categorization [3] and protein classification problems [4]. Our work attempts to give a computational alternative to predefined kernels by learning kernel operators from data. We start with a few definitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a finite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix.   The specific setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to find a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be defined shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this definition, they defined the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a finite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “flattened” into a vector of dimension m2 . Therefore, this definition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself suffices to construct an excellent classifier f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to find a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classifier whose empirical loss is zero. Furthermore, the task of finding a good kernel when it is not always possible to find a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difficulties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore first discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classification error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] unifies the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X →   Figure 1: The skeleton of the boosting algorithm for kernels. ical classification error. Given the prediction of a classifier f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are defined as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classifier f (x). This classifier is a weighted combination of (possibly very simple) base classifiers. (In the boosting framework, the base classifiers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to find a good of weights αt . Our starting point in this paper is the first sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-fly. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reflects how difficult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classification algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspecified. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we define the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself.     Using this definition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a specific base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We define the score of a kernel Kw w.r.t to the current distribution Dt to be,         Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw fits the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efficiently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now define a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to finding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semidefinite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for finding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions.     Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efficiently.   First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the first two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters,   µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the first two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (first and third from left) and after learning (second and fourth). For each of the two settings we show the first two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we first learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classifier on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classifier with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the figure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the first two features. In all plots, each bar indicates a 95% confidence level. It is clear from the figure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reflects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the benefits of using a boutique kernel. The first and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the first eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the first and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the first group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classification problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classifiers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classification problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the first ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as significant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classifiers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.

4 0.2280238 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

Author: Tong Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert function Spaces. We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. Using this concept, we can naturally extend results for parametric estimation problems in finite dimensional spaces to non-parametric kernel learning methods. We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances.

5 0.1756566 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA

Author: Jean-philippe Vert, Minoru Kanehisa

Abstract: We present an algorithm to extract features from high-dimensional gene expression profiles, based on the knowledge of a graph which links together genes known to participate to successive reactions in metabolic pathways. Motivated by the intuition that biologically relevant features are likely to exhibit smoothness with respect to the graph topology, the algorithm involves encoding the graph and the set of expression profiles into kernel functions, and performing a generalized form of canonical correlation analysis in the corresponding reproducible kernel Hilbert spaces. Function prediction experiments for the genes of the yeast S. Cerevisiae validate this approach by showing a consistent increase in performance when a state-of-the-art classifier uses the vector of features instead of the original expression profile to predict the functional class of a gene.

6 0.16549389 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

7 0.15944521 119 nips-2002-Kernel Dependency Estimation

8 0.13963388 125 nips-2002-Learning Semantic Similarity

9 0.13240947 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

10 0.13175656 113 nips-2002-Information Diffusion Kernels

11 0.13002202 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces

12 0.12442579 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

13 0.11771702 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

14 0.10435411 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

15 0.10027793 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

16 0.098863035 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

17 0.09816356 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

18 0.08748515 124 nips-2002-Learning Graphical Models with Mercer Kernels

19 0.086855195 45 nips-2002-Boosted Dyadic Kernel Discriminants

20 0.086627364 19 nips-2002-Adapting Codes and Embeddings for Polychotomies


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.255), (1, -0.182), (2, 0.135), (3, -0.173), (4, -0.103), (5, -0.117), (6, 0.163), (7, 0.212), (8, 0.006), (9, 0.013), (10, 0.082), (11, 0.017), (12, 0.034), (13, -0.004), (14, 0.045), (15, 0.014), (16, -0.006), (17, -0.14), (18, -0.054), (19, -0.046), (20, -0.019), (21, -0.017), (22, -0.019), (23, -0.004), (24, -0.024), (25, 0.027), (26, -0.003), (27, 0.039), (28, 0.037), (29, 0.122), (30, 0.027), (31, -0.042), (32, 0.127), (33, 0.056), (34, -0.012), (35, 0.003), (36, -0.08), (37, 0.033), (38, 0.01), (39, 0.008), (40, -0.018), (41, 0.025), (42, 0.047), (43, 0.015), (44, 0.032), (45, 0.014), (46, 0.032), (47, 0.027), (48, 0.039), (49, -0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97897798 106 nips-2002-Hyperkernels

Author: Cheng S. Ong, Robert C. Williamson, Alex J. Smola

Abstract: We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.

2 0.90223885 156 nips-2002-On the Complexity of Learning the Kernel Matrix

Author: Olivier Bousquet, Daniel Herrmann

Abstract: We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.

3 0.79962271 120 nips-2002-Kernel Design Using Boosting

Author: Koby Crammer, Joseph Keshet, Yoram Singer

Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efficient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a fixed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classification, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classification and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predefined kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predefined kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for specific tasks such as text categorization [3] and protein classification problems [4]. Our work attempts to give a computational alternative to predefined kernels by learning kernel operators from data. We start with a few definitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a finite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix.   The specific setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to find a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be defined shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this definition, they defined the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a finite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “flattened” into a vector of dimension m2 . Therefore, this definition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself suffices to construct an excellent classifier f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to find a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classifier whose empirical loss is zero. Furthermore, the task of finding a good kernel when it is not always possible to find a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difficulties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore first discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classification error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] unifies the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X →   Figure 1: The skeleton of the boosting algorithm for kernels. ical classification error. Given the prediction of a classifier f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are defined as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classifier f (x). This classifier is a weighted combination of (possibly very simple) base classifiers. (In the boosting framework, the base classifiers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to find a good of weights αt . Our starting point in this paper is the first sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-fly. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reflects how difficult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classification algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspecified. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we define the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself.     Using this definition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a specific base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We define the score of a kernel Kw w.r.t to the current distribution Dt to be,         Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw fits the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efficiently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now define a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to finding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semidefinite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for finding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions.     Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efficiently.   First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the first two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters,   µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the first two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (first and third from left) and after learning (second and fourth). For each of the two settings we show the first two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we first learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classifier on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classifier with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the figure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the first two features. In all plots, each bar indicates a 95% confidence level. It is clear from the figure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reflects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the benefits of using a boutique kernel. The first and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the first eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the first and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the first group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classification problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classifiers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classification problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the first ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as significant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classifiers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.

4 0.73784107 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

Author: Tong Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert function Spaces. We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. Using this concept, we can naturally extend results for parametric estimation problems in finite dimensional spaces to non-parametric kernel learning methods. We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances.

5 0.72843015 119 nips-2002-Kernel Dependency Estimation

Author: Jason Weston, Olivier Chapelle, Vladimir Vapnik, André Elisseeff, Bernhard Schölkopf

Abstract: We consider the learning problem of finding a dependency between a general class of objects and another, possibly different, general class of objects. The objects can be for example: vectors, images, strings, trees or graphs. Such a task is made possible by employing similarity measures in both input and output spaces using kernel functions, thus embedding the objects into vector spaces. We experimentally validate our approach on several tasks: mapping strings to strings, pattern recognition, and reconstruction from partial images. 1

6 0.72023743 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

7 0.71810454 113 nips-2002-Information Diffusion Kernels

8 0.68990171 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA

9 0.68271166 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

10 0.61509609 125 nips-2002-Learning Semantic Similarity

11 0.60313463 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

12 0.60056496 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces

13 0.59674132 167 nips-2002-Rational Kernels

14 0.53544891 124 nips-2002-Learning Graphical Models with Mercer Kernels

15 0.45759231 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

16 0.447496 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

17 0.41111949 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

18 0.41035184 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression

19 0.39491385 178 nips-2002-Robust Novelty Detection with Single-Class MPM

20 0.38214293 45 nips-2002-Boosted Dyadic Kernel Discriminants


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.025), (23, 0.014), (42, 0.063), (47, 0.014), (54, 0.189), (55, 0.079), (57, 0.023), (64, 0.01), (67, 0.013), (68, 0.021), (71, 0.193), (74, 0.079), (87, 0.014), (92, 0.059), (98, 0.106)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98372203 30 nips-2002-Annealing and the Rate Distortion Problem

Author: Albert E. Parker, Tomá\v S. Gedeon, Alexander G. Dimitrov

Abstract: In this paper we introduce methodology to determine the bifurcation structure of optima for a class of similar cost functions from Rate Distortion Theory, Deterministic Annealing, Information Distortion and the Information Bottleneck Method. We also introduce a numerical algorithm which uses the explicit form of the bifurcating branches to find optima at a bifurcation point. 1

2 0.9044413 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach

Author: Christopher G. Atkeson, Jun Morimoto

Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.

same-paper 3 0.88601595 106 nips-2002-Hyperkernels

Author: Cheng S. Ong, Robert C. Williamson, Alex J. Smola

Abstract: We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.

4 0.78620744 119 nips-2002-Kernel Dependency Estimation

Author: Jason Weston, Olivier Chapelle, Vladimir Vapnik, André Elisseeff, Bernhard Schölkopf

Abstract: We consider the learning problem of finding a dependency between a general class of objects and another, possibly different, general class of objects. The objects can be for example: vectors, images, strings, trees or graphs. Such a task is made possible by employing similarity measures in both input and output spaces using kernel functions, thus embedding the objects into vector spaces. We experimentally validate our approach on several tasks: mapping strings to strings, pattern recognition, and reconstruction from partial images. 1

5 0.78301734 10 nips-2002-A Model for Learning Variance Components of Natural Images

Author: Yan Karklin, Michael S. Lewicki

Abstract: We present a hierarchical Bayesian model for learning efficient codes of higher-order structure in natural images. The model, a non-linear generalization of independent component analysis, replaces the standard assumption of independence for the joint distribution of coefficients with a distribution that is adapted to the variance structure of the coefficients of an efficient image basis. This offers a novel description of higherorder image structure and provides a way to learn coarse-coded, sparsedistributed representations of abstract image properties such as object location, scale, and texture.

6 0.78150588 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

7 0.77845436 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

8 0.77812564 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

9 0.77794909 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking

10 0.77680707 53 nips-2002-Clustering with the Fisher Score

11 0.77557367 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy

12 0.77470469 27 nips-2002-An Impossibility Theorem for Clustering

13 0.77372724 156 nips-2002-On the Complexity of Learning the Kernel Matrix

14 0.7729308 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

15 0.77285033 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

16 0.77178228 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

17 0.77118284 190 nips-2002-Stochastic Neighbor Embedding

18 0.76939923 140 nips-2002-Margin Analysis of the LVQ Algorithm

19 0.76874763 2 nips-2002-A Bilinear Model for Sparse Coding

20 0.76757497 21 nips-2002-Adaptive Classification by Variational Kalman Filtering