nips nips2000 nips2000-134 knowledge-graph by maker-knowledge-mining

134 nips-2000-The Kernel Trick for Distances


Source: pdf

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. [sent-4, score-0.712]

2 This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. [sent-5, score-0.407]

3 It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. [sent-6, score-0.879]

4 As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms. [sent-7, score-0.038]

5 1 Introduction One of the crucial ingredients of SVMs is the so-called kernel trick for the computation of dot products in high-dimensional feature spaces using simple functions defined on pairs of input patterns. [sent-8, score-0.854]

6 This trick allows the formulation of nonlinear variants of any algorithm that can be cast in terms of dot products, SVMs being but the most prominent example [13, 8]. [sent-9, score-0.366]

7 Although the mathematical result underlying the kernel trick is almost a century old [6], it was only much later [1, 3,13] that it was made fruitful for the machine learning community. [sent-10, score-0.522]

8 Kernel methods have since led to interesting generalizations of learning algorithms and to successful real-world applications. [sent-11, score-0.097]

9 The present paper attempts to extend the utility of the kernel trick by looking at the problem of which kernels can be used to compute distances in feature spaces. [sent-12, score-0.991]

10 Again, the underlying mathematical results, mainly due to Schoenberg, have been known for a while [7]; some of them have already attracted interest in the kernel methods community in various contexts [11, 5, 15]. [sent-13, score-0.387]

11 We are interested in predicting the outputs y for previously unseen patterns x. [sent-21, score-0.054]

12 This is only possible if we have some measure that tells us how (x, y) is related to the training examples. [sent-22, score-0.054]

13 On the outputs, similarity is usually measured in terms of a loss function. [sent-25, score-0.065]

14 On the inputs, the notion of similarity is more complex. [sent-27, score-0.065]

15 It hinges on a representation of the patterns and a suitable similarity measure operating on that representation. [sent-28, score-0.147]

16 One particularly simple yet surprisingly useful notion of (dis)similarity - the one we will use in this paper - derives from embedding the data into a Euclidean space and utilizing geometrical concepts. [sent-29, score-0.051]

17 For instance, in SVMs, similarity is measured by dot products (i. [sent-30, score-0.246]

18 angles and lengths) in some high-dimensional feature space F . [sent-32, score-0.146]

19 Formally, the patterns are first mapped into Fusing ¢ : X -t F, x I-t ¢(x), and then compared using a dot product (¢(x), ¢(X')). [sent-33, score-0.232]

20 To avoid working in the potentially high-dimensional space F, one tries to pick a feature space in which the dot product can be evaluated directly using a nonlinear function in input space, i. [sent-34, score-0.451]

21 by means of the kernel trick k(x, x') = (¢(x), ¢(X')). [sent-36, score-0.522]

22 (1) Often, one simply chooses a kernel k with the property that there exists some ¢ such that the above holds true, without necessarily worrying about the actual form of ¢ - already the existence of the linear space F facilitates a number of algorithmic and theoretical issues. [sent-37, score-0.443]

23 It is well established that (1) works out for Mercer kernels [3, 13], or, equivalently, positive definite kernels [2, 14]. [sent-38, score-0.956]

24 Definition 1 (Positive definite kernel) A symmetric function k : X x X -t IR which for all mEN, Xi E X gives rise to a positive definite Gram matrix, i. [sent-43, score-0.66]

25 J l ,J=1 is called a positive definite (pd) kernel. [sent-49, score-0.353]

26 := k(Xi, Xj), (2) One particularly intuitive way to construct a feature map satisfying (1) for such a kernel k proceeds, in a nutshell, as follows (for details, see [2]): 1. [sent-50, score-0.584]

27 Define a feature map ¢ : X -t IRA:', X I-t k(. [sent-51, score-0.143]

28 (3) Here, IRA:' denotes the space of functions mapping X into Ilt 2. [sent-53, score-0.051]

29 Turn it into a linear space by forming linear combinations m m' i=1 j=1 (4) 3. [sent-54, score-0.051]

30 Endow it with a dot product (1, g) := 2::1 2:;~1 ai/Jjk(xi,xj), and turn it into a Hilbert space Hk by completing it in the corresponding norm. [sent-55, score-0.28]

31 Note that in particular, by definition ofthe dot product, (k(. [sent-56, score-0.201]

32 , x')) = k(x, x'), hence, in view of (3), we have k(x,x' ) = (¢(X),¢(X')), the kernel trick. [sent-58, score-0.361]

33 This shows that pd kernels can be thought of as (nonlinear) generalizations of one of the simplest similarity measures, the canonical dot product (x, x') , x, x' E IRN. [sent-59, score-0.903]

34 The question arises as to whether there also exi st generalizations of the simplest dissimilarity measure, the di stance Ilx - x'11 2 . [sent-60, score-0.145]

35 Clearly, the distance 11¢(x) - ¢(X') 112 in the feature space associated with a pd kernel k can be computed using the kernel trick (1) as k(x, x) + k(X', x') - 2k(x, x') . [sent-61, score-1.343]

36 Positive definite kernels are, however, not the full story: there exists a larger class of kernels that can be used as generalized distances, and the following section will describe why. [sent-62, score-0.866]

37 2 Kernels as Generalized Distance Measures Let us start by considering how a dot product and the corresponding distance measure are affected by a translation of the data, x I-t x - Xo. [sent-63, score-0.412]

38 Clearly, Ilx - x' 11 2 is translation invariant while (x, x') is not. [sent-64, score-0.129]

39 A short calculation shows that the effect of the translation can be expressed in terms of II. [sent-65, score-0.115]

40 (5) Note that this is, just like (x,x /), still a pd kernel: ~i,j CiCj((Xi - xo), (Xj - xo)) = II ~i Ci(Xi - xo)112 ~ O. [sent-68, score-0.253]

41 For any choice of Xo E X, we thus get a similarity measure (5) associated with the dissimilarity measure Ilx - x'II. [sent-69, score-0.204]

42 This naturally leads to the question whether (5) might suggest a connection that holds true also in more general cases: what kind of nonlinear dissimilarity measure do we have to substitute instead of II. [sent-70, score-0.17]

43 11 2 on the right hand side of (5) to ensure that the left hand side becomes positive definite? [sent-72, score-0.129]

44 Definition 2 (Conditionally positive definite kernel) A symmetric function k : X x X -t IR which satisfies (2) for all mEN, Xi E X and for all Ci E IR with ~~ Ci = 0, L. [sent-75, score-0.436]

45 t =l (6) is called a conditionally positive definite (cpd) kernel. [sent-78, score-0.515]

46 Proposition 3 (Connection pd - cpd [2]) Let Xo E X, and let k be a symmetric kernel on X x X. [sent-79, score-1.144]

47 Then k(x, x') is positive definite := ~ (k(x, x') - k(x, xo) - k(xo, x') + k(xo, xo)) (7) if and only if k is conditionally positive definite. [sent-80, score-0.644]

48 The proof follows directly from the definitions and can be found in [2]. [sent-81, score-0.043]

49 This result does generalize (5): the negative squared distance kernel is indeed cpd, for ~i Ci = 0 implies - ~i,j cicjllxi - xjl12 = - ~i Ci ~j Cj IIxjl12 - ~j Cj ~i cillxil1 2+ 2~i,j CiCj (Xi, Xj) = 2~i,j CiCj (Xi, Xj) = 211 ~i CiXi 112 ~ O. [sent-82, score-0.504]

50 In fact, this implies that all kernels of the form (8) k(x, x') = -llx - x /II/3, 0 < f3 ~ 2 are cpd (they are not pd), by application of the following result: Proposition 4 ([2]) If k : X x X -t] - 00,0] is cpd, then so are - (_k)O< (0 and -log(l - k). [sent-83, score-0.723]

51 < Q < 1) To state another class of cpd kernels that are not pd, note first that as trivial consequences of Definition 2, we know that (i) sums of cpd kernels are cpd, and (ii) any constant b E IR is cpd. [sent-84, score-1.431]

52 Therefore, any kernel of the form k + b, where k is cpd and b E IR, is also cpd. [sent-85, score-0.771]

53 In particular, since pd kernels are cpd, we can take any pd kernel and offset it by b and it will still be at least cpd. [sent-86, score-1.19]

54 Proposition 3 allows us to construc5 the feature map for k from that of the pd kernel k. [sent-90, score-0.783]

55 Therefore, we may employ the Hilbert space representation ¢ : X -t H of k (ct. [sent-93, score-0.078]

56 (1», satisfying (¢(x), ¢(X')) = k(x, x'), hence 11¢(x) - ¢(x' )112 = (¢(x) - ¢(X'), ¢(x) - ¢(X')) = k(x, x) + k(X', x') - 2k(x, x'). [sent-94, score-0.054]

57 Proposition 5 (Hilbert space representation of cpd kernels [7, 2]) Let k be a realvalued conditionally positive definite kernel on X, satisfying k(x, x) = 0 for all x E X. [sent-97, score-1.746]

58 Then there exists a Hilbert space H of real-valued functions on X, and a mapping 4> : X -t H, such that (11) 114>(x) - 4>(x' )112 = -k(x, x'). [sent-98, score-0.082]

59 space representation reads 114>(x) - 4>(x' )112 1 = -k(x, x') + 2 (k(x, x) + k(X', x')) . [sent-100, score-0.078]

60 We next show how to represent general symmetric kernels (thus in particular cpd kernels) as symmetric bilinear forms Q in feature spaces. [sent-104, score-1.025]

61 This generalization of the previously known feature space representation for pd kernels comes at a cost: Q will no longer be a dot product. [sent-105, score-0.871]

62 The result will give us an intuitive understanding of Proposition 3: we can then write k as k(X,X') := Q(4)(x) 4>(xo), 4>(x' ) - 4>(xo)). [sent-107, score-0.052]

63 Proposition 3 thus essentially adds an origin in feature space which corresponds to the image 4>(xo) of one point Xo under the feature map. [sent-108, score-0.291]

64 For translation invariant algorithms, we are always allowed to do this, and thus turn a cpd kernel into a pd one - in this sense, cpd kernels are "as good as" pd kernels. [sent-109, score-2.129]

65 Proposition 6 (Vector space representation of symmetric kernels) Let k be a realvalued symmetric kernel on X. [sent-110, score-0.644]

66 space H of real-valued functions on X , endowed with a symmetric bilinear form Q(. [sent-112, score-0.199]

67 (13) Proof The proof is a direct modification of the pd case. [sent-115, score-0.296]

68 We use the map (3) and linearly complete the image as in (4). [sent-116, score-0.048]

69 • Note, moreover, that by definition of Q, k is a reproducing kernel for the feature space (which is not a Hilbert space): for all functions f (4), we have Q(k(. [sent-121, score-0.552]

70 23, [2]) Let K be a symmetric matrix, e E ~m be the vector of all ones, J the m x m identity matrix, and let c E em satisfy e*c = 1. [sent-132, score-0.146]

71 Then K := (J is positive definite ec*)K(J - ce*) (14) if and only if K is conditionally positive definite. [sent-133, score-0.644]

72 0 ~ a* Ka, proving that K is conditionally positive definite. [sent-140, score-0.291]

73 The map (J - ce*) has its range in the orthogonal complement of e, which can be seen by computing, for any a E em, e*(J - ce*)a= e*a-e*ce*a = O. [sent-142, score-0.084]

74 (16) Moreover, being symmetric and satisfying (J - ce*)2 = (J - ce*), the map (J - ce*) is a projection. [sent-143, score-0.185]

75 Thus K is the restriction of K to the orthogonal complement of e, and by definition of conditional positive definiteness, that is precisely the space where K is positive definite. [sent-144, score-0.39]

76 • This result directly implies a corresponding generalization of Proposition 3: Proposition 8 (Adding a general origin) Let k be a symmetric kernel, and let Ci E satisfy E~l Ci = 1. [sent-145, score-0.144]

77 ,X m E X, (17) is positive definite if and only if k is conditionally positive definite. [sent-149, score-0.644]

78 Seen in this light, it is not surprising that the structure of the dual optimization problem (cf [13}) allows cpd kernels: as noticed in [11, 10}, the constraint E~l QiYi = 0 projects out the same sub. [sent-167, score-0.41]

79 lpace as (6) in the definition of conditionally positive definite kernels. [sent-168, score-0.56]

80 (ii) Another example of a kernel algorithm that works with conditionally positive definite kernels is kernel peA [9}, where the data is centered, thus removing the dependence on the origin infeature . [sent-169, score-1.601]

81 Example 10 (Parzen windows) One of the simplest distance-based classification algorithms conceivable proceeds as follows. [sent-172, score-0.106]

82 Note thatfor some cpd kernels, such as (8), k(Xi, Xi) is always 0, thus c = O. [sent-174, score-0.41]

83 For normalized Gaussians and other kernels that are valid density models, the resulting decision boundary can be interpreted as the Bayes decision based on two Parzen windows density estimates of the classes; for general cpd kernels, the analogy is a mere formal one. [sent-176, score-0.767]

84 J, we illustrate the finding that kernel peA can be carried out using cpd kernels. [sent-178, score-0.771]

85 Due to the centering that is built into kernel peA (cf Example 9, (ii), and (5)), the case (3 = 2 actually is equivalent to linear peA. [sent-180, score-0.361]

86 As we decrease (3, we obtain increasingly nonlinear feature extractors. [sent-181, score-0.175]

87 Note, moreover, that as the kernel parameter (3 gets smaller, less weight is put on large distances, and we get more localizedfeature extractors (in the sense that the regions where they have large gradients, i. [sent-182, score-0.465]

88 dense sets of contour lines in the plot, get more localized). [sent-184, score-0.055]

89 Figure 1: Kernel PCA on a toy dataset using the cpd kernel (8); contour plots of the feature extractors corresponding to projections onto the first two principal axes in feature space. [sent-185, score-1.091]

90 Notice how smaller values of (3 make the feature extractors increasingly nonlinear, which allows the identification of the cluster structure. [sent-189, score-0.201]

91 3 Conclusion We have described a kernel trick for distances in feature spaces. [sent-190, score-0.702]

92 It can be used to generalize all distance based algorithms to a feature space setting by substituting a suitable kernel function for the squared distance. [sent-191, score-0.692]

93 The class of kernels that can be used is larger than those commonly used in kernel methods (known as positive definite kernels) . [sent-192, score-1.058]

94 We have argued that this reflects the translation invariance of distance based algorithms, as opposed to genuinely dot product based algorithms. [sent-193, score-0.423]

95 SVMs and kernel PCA are translation invariant in feature space, hence they are really both distance rather than dot product based. [sent-194, score-0.876]

96 We thus argued that they can both use conditionally positive definite kernels . [sent-195, score-0.838]

97 In the case of the SVM, this drops out of the optimization problem automatically [11], in the case of kernel PCA, it corresponds to the introduction of a reference point in feature space. [sent-196, score-0.456]

98 The contribution of the present work is that it identifies translation invariance as the underlying reason, thus enabling us to use cpd kernels in a much larger class of kernel algorithms, and that it draws the learning community's attention to the kernel trick for distances. [sent-197, score-1.764]

99 Functions of positive and negative type and their connection with the theory of integral equations. [sent-243, score-0.168]

100 The connection between regularization operators and support vector kernels. [sent-290, score-0.062]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cpd', 0.41), ('kernel', 0.361), ('kernels', 0.289), ('xo', 0.256), ('pd', 0.253), ('proposition', 0.225), ('definite', 0.224), ('conditionally', 0.162), ('trick', 0.161), ('dot', 0.156), ('positive', 0.129), ('ci', 0.115), ('ce', 0.112), ('feature', 0.095), ('translation', 0.092), ('hilbert', 0.091), ('distances', 0.085), ('symmetric', 0.083), ('svms', 0.083), ('xi', 0.081), ('ir', 0.081), ('cicj', 0.075), ('extractors', 0.075), ('similarity', 0.065), ('bilinear', 0.065), ('ilx', 0.064), ('distance', 0.061), ('generalizations', 0.059), ('pea', 0.056), ('dissimilarity', 0.054), ('ka', 0.054), ('satisfying', 0.054), ('space', 0.051), ('origin', 0.05), ('ira', 0.05), ('kce', 0.05), ('men', 0.05), ('product', 0.049), ('nonlinear', 0.049), ('map', 0.048), ('smola', 0.047), ('xj', 0.046), ('pca', 0.046), ('definition', 0.045), ('ec', 0.044), ('schdlkopf', 0.043), ('proof', 0.043), ('connection', 0.039), ('realvalued', 0.039), ('chris', 0.039), ('parzen', 0.039), ('gram', 0.039), ('watkins', 0.039), ('algorithms', 0.038), ('let', 0.037), ('invariant', 0.037), ('complement', 0.036), ('proceeds', 0.036), ('lt', 0.034), ('argued', 0.034), ('cj', 0.034), ('offset', 0.034), ('class', 0.033), ('yi', 0.033), ('spaces', 0.033), ('cf', 0.032), ('simplest', 0.032), ('exists', 0.031), ('generalize', 0.031), ('invariance', 0.031), ('increasingly', 0.031), ('toy', 0.029), ('get', 0.029), ('measure', 0.028), ('acm', 0.028), ('substituting', 0.028), ('patterns', 0.027), ('points', 0.027), ('labelled', 0.027), ('squared', 0.027), ('outputs', 0.027), ('representation', 0.027), ('intuitive', 0.026), ('contour', 0.026), ('community', 0.026), ('us', 0.026), ('em', 0.026), ('really', 0.025), ('products', 0.025), ('works', 0.025), ('turn', 0.024), ('windows', 0.024), ('metric', 0.024), ('implies', 0.024), ('moreover', 0.023), ('calculation', 0.023), ('crucial', 0.023), ('regularization', 0.023), ('commonly', 0.022), ('decision', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

2 0.26378471 110 nips-2000-Regularization with Dot-Product Kernels

Author: Alex J. Smola, Zoltán L. Óvári, Robert C. Williamson

Abstract: In this paper we give necessary and sufficient conditions under which kernels of dot product type k(x, y) = k(x . y) satisfy Mercer's condition and thus may be used in Support Vector Machines (SVM), Regularization Networks (RN) or Gaussian Processes (GP). In particular, we show that if the kernel is analytic (i.e. can be expanded in a Taylor series), all expansion coefficients have to be nonnegative. We give an explicit functional form for the feature map by calculating its eigenfunctions and eigenvalues. 1

3 0.25284818 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

Author: Christopher K. I. Williams

Abstract: In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed. 1

4 0.22901618 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1

5 0.22621349 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1

6 0.14579666 4 nips-2000-A Linear Programming Approach to Novelty Detection

7 0.13165052 74 nips-2000-Kernel Expansions with Unlabeled Examples

8 0.12854181 133 nips-2000-The Kernel Gibbs Sampler

9 0.11875787 54 nips-2000-Feature Selection for SVMs

10 0.11698882 75 nips-2000-Large Scale Bayes Point Machines

11 0.11647652 58 nips-2000-From Margin to Sparsity

12 0.11359821 23 nips-2000-An Adaptive Metric Machine for Pattern Classification

13 0.11225897 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

14 0.10837058 120 nips-2000-Sparse Greedy Gaussian Process Regression

15 0.090523139 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

16 0.089303389 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

17 0.079990007 12 nips-2000-A Support Vector Method for Clustering

18 0.079969704 137 nips-2000-The Unscented Particle Filter

19 0.078777276 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

20 0.07654395 18 nips-2000-Active Support Vector Machine Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.265), (1, 0.23), (2, -0.085), (3, 0.124), (4, -0.098), (5, 0.409), (6, -0.075), (7, 0.003), (8, -0.066), (9, 0.06), (10, 0.011), (11, -0.066), (12, -0.113), (13, -0.004), (14, 0.036), (15, -0.035), (16, 0.244), (17, 0.061), (18, 0.022), (19, -0.056), (20, -0.095), (21, 0.018), (22, -0.007), (23, 0.06), (24, 0.113), (25, 0.043), (26, -0.025), (27, -0.121), (28, 0.013), (29, -0.044), (30, -0.063), (31, -0.081), (32, -0.066), (33, 0.073), (34, -0.004), (35, -0.097), (36, 0.003), (37, 0.084), (38, 0.014), (39, 0.016), (40, 0.033), (41, -0.016), (42, -0.009), (43, 0.065), (44, -0.056), (45, -0.017), (46, 0.048), (47, 0.019), (48, -0.054), (49, -0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98177123 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

2 0.83753246 110 nips-2000-Regularization with Dot-Product Kernels

Author: Alex J. Smola, Zoltán L. Óvári, Robert C. Williamson

Abstract: In this paper we give necessary and sufficient conditions under which kernels of dot product type k(x, y) = k(x . y) satisfy Mercer's condition and thus may be used in Support Vector Machines (SVM), Regularization Networks (RN) or Gaussian Processes (GP). In particular, we show that if the kernel is analytic (i.e. can be expanded in a Taylor series), all expansion coefficients have to be nonnegative. We give an explicit functional form for the feature map by calculating its eigenfunctions and eigenvalues. 1

3 0.82047141 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

Author: Christopher K. I. Williams

Abstract: In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed. 1

4 0.70243347 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1

5 0.67827904 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

Author: Sebastian Mika, Gunnar R채tsch, Klaus-Robert M체ller

Abstract: We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD). A mathematical programming formulation based on the observation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. Simulations support the usefulness of our approach.

6 0.66409564 121 nips-2000-Sparse Kernel Principal Component Analysis

7 0.48885924 74 nips-2000-Kernel Expansions with Unlabeled Examples

8 0.46682626 4 nips-2000-A Linear Programming Approach to Novelty Detection

9 0.40543514 23 nips-2000-An Adaptive Metric Machine for Pattern Classification

10 0.37404945 120 nips-2000-Sparse Greedy Gaussian Process Regression

11 0.37043509 133 nips-2000-The Kernel Gibbs Sampler

12 0.36904851 54 nips-2000-Feature Selection for SVMs

13 0.29233935 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

14 0.28253168 58 nips-2000-From Margin to Sparsity

15 0.27488393 12 nips-2000-A Support Vector Method for Clustering

16 0.27162457 75 nips-2000-Large Scale Bayes Point Machines

17 0.26020476 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

18 0.25231102 148 nips-2000-`N-Body' Problems in Statistical Learning

19 0.24962327 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

20 0.24239936 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.033), (17, 0.133), (33, 0.058), (55, 0.017), (62, 0.034), (65, 0.014), (67, 0.401), (75, 0.02), (76, 0.103), (79, 0.019), (90, 0.04), (97, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98888737 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

Author: Ilya Nemenman, William Bialek

Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x.}] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ .}) - (Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) est X X. (Q(Xl)Q(X2) ... Q(XN ))(0) , (2) where ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. Other definitions are also possible [6], but we think that most of our results do not depend on this choice. = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. These conditions imply that our field theory must be convergent at small length scales. For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective action [Eq. (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as ....., 1/ iN P( x). Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. However, both similarities are incomplete. The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. The algorithm to implement the discussed density estimation procedure numerically is rather simple. First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is then solved by a standard 'relaxation' (or Newton) method of iterative improvements to a guessed solution [13] (the target precision is always 10- 5 ). The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '

2 0.98841918 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations

Author: Oren Shriki, Haim Sompolinsky, Daniel D. Lee

Abstract: The principle of maximizing mutual information is applied to learning overcomplete and recurrent representations. The underlying model consists of a network of input units driving a larger number of output units with recurrent interactions. In the limit of zero noise, the network is deterministic and the mutual information can be related to the entropy of the output units. Maximizing this entropy with respect to both the feedforward connections as well as the recurrent interactions results in simple learning rules for both sets of parameters. The conventional independent components (ICA) learning algorithm can be recovered as a special case where there is an equal number of output units and no recurrent connections. The application of these new learning rules is illustrated on a simple two-dimensional input example.

same-paper 3 0.93908966 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

4 0.81157774 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account

Author: Gal Chechik, Naftali Tishby

Abstract: The paradigm of Hebbian learning has recently received a novel interpretation with the discovery of synaptic plasticity that depends on the relative timing of pre and post synaptic spikes. This paper derives a temporally dependent learning rule from the basic principle of mutual information maximization and studies its relation to the experimentally observed plasticity. We find that a supervised spike-dependent learning rule sharing similar structure with the experimentally observed plasticity increases mutual information to a stable near optimal level. Moreover, the analysis reveals how the temporal structure of time-dependent learning rules is determined by the temporal filter applied by neurons over their inputs. These results suggest experimental prediction as to the dependency of the learning rule on neuronal biophysical parameters 1

5 0.77536476 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

Author: Sumio Watanabe

Abstract: Algebraic geometry is essential to learning theory. In hierarchical learning machines such as layered neural networks and gaussian mixtures, the asymptotic normality does not hold , since Fisher information matrices are singular. In this paper , the rigorous asymptotic form of the stochastic complexity is clarified based on resolution of singularities and two different problems are studied. (1) If the prior is positive, then the stochastic complexity is far smaller than BIO, resulting in the smaller generalization error than regular statistical models, even when the true distribution is not contained in the parametric model. (2) If Jeffreys' prior, which is coordinate free and equal to zero at singularities, is employed then the stochastic complexity has the same form as BIO. It is useful for model selection, but not for generalization. 1

6 0.71138924 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

7 0.71111327 146 nips-2000-What Can a Single Neuron Compute?

8 0.70173538 21 nips-2000-Algorithmic Stability and Generalization Performance

9 0.6974321 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

10 0.69323462 44 nips-2000-Efficient Learning of Linear Perceptrons

11 0.69321257 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

12 0.6817652 22 nips-2000-Algorithms for Non-negative Matrix Factorization

13 0.67909634 79 nips-2000-Learning Segmentation by Random Walks

14 0.66803467 111 nips-2000-Regularized Winnow Methods

15 0.66785103 37 nips-2000-Convergence of Large Margin Separable Linear Classification

16 0.66535968 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

17 0.66411507 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

18 0.66392612 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

19 0.65547532 125 nips-2000-Stability and Noise in Biochemical Switches

20 0.65253407 49 nips-2000-Explaining Away in Weight Space