nips nips2000 nips2000-95 knowledge-graph by maker-knowledge-mining

95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling


Source: pdf

Author: Christopher K. I. Williams

Abstract: In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i. [sent-11, score-1.152]

2 This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. [sent-14, score-0.495]

3 In multidimensional scaling (MDS) the aim is to place n points in a low dimensional space (usually Euclidean) so that the interpoint distances dij have a particular relationship to the original dissimilarities. [sent-17, score-0.823]

4 In classical scaling we would like the interpoint distances to be equal to the dissimilarities. [sent-18, score-0.686]

5 For example, classical scaling can be used to reconstruct a map of the locations of some cities given the distances between them. [sent-19, score-0.567]

6 In metric MDS the relationship is of the form dij ~ f(Oij) where f is a specific function. [sent-20, score-0.313]

7 In this paper we show that the kernel peA algorithm of Sch6lkopf et al [7] can be interpreted as performing metric MDS if the kernel function is isotropic. [sent-21, score-0.877]

8 This is achieved by performing classical scaling in the feature space defined by the kernel. [sent-22, score-0.578]

9 The structure of the remainder of this paper is as follows: In section 2 classical and metric MDS are reviewed, and in section 3 the kernel peA algorithm is described. [sent-23, score-0.766]

10 Section 5 describes approaches to choosing the kernel function, and we finish with a brief discussion in section 6. [sent-25, score-0.364]

11 1 Classical scaling Given n objects and the corresponding dissimilarity matrix, classical scaling is an algebraic method for finding a set of points in space so that the dissimilarities are well-approximated by the interpoint distances. [sent-27, score-1.247]

12 The classical scaling algorithm is introduced below by starting with the locations of n points, constructing a dissimilarity matrix based on their Euclidean distances, and then showing how the configuration of the points can be reconstructed (as far as possible) from the dissimilarity matrix. [sent-28, score-0.927]

13 Let the coordinates of n points in p dimensions be denoted by Xi, i = 1, . [sent-29, score-0.204]

14 These can be collected together in a n x p matrix X . [sent-33, score-0.09]

15 The dissimilarities are calculated by 8;j = (Xi - Xj)T(Xi - Xj). [sent-34, score-0.237]

16 Given these dissimilarities, we construct the matrix A such that aij = 8;j' and then set B = H AH, where H is the centering matrix H = In - ~l1T . [sent-35, score-0.263]

17 In matrix form we have B = (HX)(HX)T, and B is real, symmetric and positive semi-definite. [sent-37, score-0.09]

18 Let the eigendecomposition of B be B = V AV T , where A is a diagonal matrix and V is a matrix whose columns are the eigenvectors of B. [sent-38, score-0.309]

19 If p < n, there will be n - p zero eigenvalues l . [sent-39, score-0.127]

20 ,Ap) and Vp is the n x p matrix whose columns correspond to the first p eigenvectors of B, with the usual normalization so that the eigenvectors have unit length. [sent-46, score-0.206]

21 The matrix X of the reconstructed coordinates "" 1 '" . [sent-47, score-0.224]

22 It may not be necessary to uses all p dimensions to obtain a reasonable approximation; a configuration X in k-dimensions can be obtained by using the largest k ! [sent-55, score-0.111]

23 These are known as the principal coordinates of X in k dimensions. [sent-57, score-0.173]

24 The fraction of the variance explained by the first k eigenvalues is A L~=l Ad L~=l Ai· Classical scaling as explained above works on Euclidean distances as the dissimilarities. [sent-58, score-0.642]

25 However, one can run the same algorithm with a non-Euclidean dissimilarity matrix, although in this case there is no guarantee that the eigenvalues will be non-negative. [sent-59, score-0.23]

26 Classical scaling derives from the work of Schoenberg and Young and Householder in the 1930's. [sent-60, score-0.235]

27 1 Opthnality properties of classical scaling Mardia et al [5] (section 14. [sent-64, score-0.496]

28 4) give the following optimality property ofthe classical scaling solution. [sent-65, score-0.48]

29 1 In fact if the points are not in "general position" the number of zero eigenvalues will be greater than n - p. [sent-66, score-0.196]

30 Theorem 1 Let X denote a configuration of points in ffi. [sent-68, score-0.149]

31 P , with interpoint distances c5ri = (Xi - Xi)T (Xi - Xi). [sent-69, score-0.234]

32 Let L be a p x p rotation matrix and set L = (L1' L 2), where L1 is p x k for k < p. [sent-70, score-0.126]

33 Let X = X L 1, the projection of X onto a k-dimensional subspace of ffi. [sent-71, score-0.125]

34 Amongst all projections X = X L 1, the quantity ¢ = Li,i (c5ri - dri) is minimized when X is projected onto its principal coordinates in k dimensions. [sent-73, score-0.224]

35 The value of ¢ for the principal coordinate projection is ¢ = 2n(Ak+1 + . [sent-75, score-0.143]

36 2 Relationships between classical scaling and peA There is a well-known relationship between PCA and classical scaling; see e. [sent-80, score-0.747]

37 Principal components analysis (PCA) is concerned with the eigendecomposition of the sample covariance matrix S = ~ XT H X. [sent-85, score-0.211]

38 It is easy to show that the eigenvalues of nS are the p non-zero eigenvalues of B. [sent-86, score-0.254]

39 (2) Thus we see that the projection of X onto the eigenvectors of nS returns the classical scaling solution. [sent-92, score-0.659]

40 3 Metric MDS The aim of classical scaling is to find a configuration of points X so that the interpoint distances dii well approximate the dissimilarities c5 ii . [sent-94, score-1.153]

41 In metric MDS this criterion is relaxed, so that instead we require (3) where f is a specified (analytic) function. [sent-95, score-0.185]

42 A straightforward way to carry out metric MDS is to define a error function (or stress) (4) where the {wii} are appropriately chosen weights. [sent-99, score-0.185]

43 One can then obtain derivatives of S with respect to the coordinates of the points that define the dii'S and use gradient-based (or more sophisticated methods) to minimize the stress. [sent-100, score-0.197]

44 2 of Cox and Cox) carried out metric MDS by running the classical scaling algorithm on the transformed dissim(for J. [sent-107, score-0.637]

45 Critchley suggests the power transformation f(oij) = the dissimilarities are derived from Euclidean distances, we note that the kernel k(x,y) = -llx-ylli3 is conditionally positive definite (CPD) if f3::; 2 [1]. [sent-110, score-0.614]

46 When the kernel is CPD, the centered matrix will be positive definite. [sent-111, score-0.384]

47 This constraint makes sense for some kinds of dissimilarity data (e. [sent-116, score-0.103]

48 3 Kernel PCA In recent years there has been an explosion of work on kernel methods. [sent-119, score-0.294]

49 A point x in the original space is re-represented as a point ¢(x) in a Np-dimensional feature space3 F, where ¢(x) = (¢1(X),¢2(X), . [sent-124, score-0.121]

50 The key to the kernel trick is to realize that for many algorithms, the only quantities required are of the form 4 ¢(Xi). [sent-129, score-0.349]

51 Sch6lkopf, Smola and Miiller [7] used this trick to define kernel peA. [sent-132, score-0.349]

52 One could compute the covariance matrix in the feature space and then calculate its eigenvectors/eigenvalues. [sent-133, score-0.234]

53 However, using the relationship between B and the sample covariance matrix S described above, we can instead consider the n x n matrix K with entries Kij = k(Xi,Xj) for i,j = 1, . [sent-134, score-0.284]

54 ,no If Np > n using K will be more efficient than working with the covariance matrix in feature space and anyway the latter would be singular. [sent-137, score-0.234]

55 The data should be centered in the feature space so that L~=l ¢(Xi) = o. [sent-138, score-0.094]

56 This is achieved by carrying out the eigendecomposition of K = H K H which gives the coordinates of the approximating points as described in section 2. [sent-139, score-0.279]

57 Thus we see that the visualization of data by projecting it onto the first k eigenvectors is exactly classical scaling in feature space. [sent-141, score-0.675]

58 4 A relationship between kernel PCA and metric MDS We consider two cases. [sent-142, score-0.533]

59 1 we deal with the case that the kernel is isotropic and obtain a close relationship between kernel PCA and metric MDS. [sent-144, score-0.975]

60 If the kernel is non-stationary a rather less close relationship is derived in section 4. [sent-145, score-0.406]

61 1 Isotropic kernels A kernelfunction is stationary if k(Xi' Xj) depends only on the vector T = Xi -Xj. [sent-151, score-0.092]

62 A stationary covariance function is isotropic if k(Xi,Xj) depends only on the distance 8ij with 8;j = T. [sent-152, score-0.26]

63 Assume that the kernel is scaled so that r(O) = 1. [sent-154, score-0.294]

64 An example of an isotropic kernel is the squared exponential or REF (radial basis function) kernel k(Xi' Xj) = exp{ -O(Xi - Xj)T(Xi - Xj)}, for some parameter 0 > O. [sent-155, score-0.736]

65 Consider the Euclidean distance in feature space 8;j = (¢(Xi) - ¢(Xj))T(¢(Xi) ¢(Xj)). [sent-156, score-0.126]

66 With an isotropic kernel this can be re-expressed as 8;j = 2(1 - r(8ij )). [sent-157, score-0.442]

67 Thus the matrix A has elements aij = r(8ij ) - 1, which can be written as A = K - 11 T. [sent-158, score-0.118]

68 It can be easily verified that the centering matrix H annihilates 11 T, so that HAH = HKH. [sent-159, score-0.145]

69 We see that the configuration of points derived from performing classical scaling on K actually aims to approximate the feature-space distances computed as 8 = ij J2(1- r(8ij )). [sent-160, score-0.826]

70 As the 8ij's are a non-linear function of the 8ij's this procedure (kernel MDS) is an example of metric MDS. [sent-161, score-0.185]

71 Remark 1 Kernel functions are usually chosen to be conditionally positive definite, so that the eigenvalues of the matrix k will be non-negative. [sent-162, score-0.243]

72 Choosing arbitrary functions to transform the dissimilarities will not give this guarantee. [sent-163, score-0.237]

73 Remark 2 In nonmetric MDS we require that dij ~ f(8 ij ) for some monotonic function f. [sent-164, score-0.182]

74 If the kernel function r is monotonically decreasing then clearly 1 - r is monotonically increasing. [sent-165, score-0.352]

75 However, there are valid isotropic kernel (covariance) functions which are non-monotonic (e. [sent-166, score-0.442]

76 the exponentially damped cosine r(8) = coo cos(w8); see [11] for details) and thus we see that f need not be monotonic in kernel MDS. [sent-168, score-0.372]

77 Remark 3 One advantage of PCA is that it defines a mapping from the original space to the principal coordinates, and hence that if a new point x arrives, its projection onto the principal coordinates defined by the original n data points can be computed 5 . [sent-169, score-0.522]

78 The same property holds in kernel PCA, so that the computation of the projection of ¢(x) onto the rth principal direction in feature space can be computed using the kernel trick as L:~=1 o:i k(x, Xi), where or is the rth eigenvector of k (see equation 4. [sent-170, score-1.075]

79 This projection property does not hold for algorithms that simply minimize the stress objective function; for example the Sammon "mapping" algorithm [6] does not in fact define a mapping. [sent-172, score-0.215]

80 We can again show that the kernel MDS procedure operates on the matrix H K H. [sent-179, score-0.384]

81 However, the distance 8 in feature space is not a ij function of 8ij and so the relationship of equation 3 does not hold. [sent-180, score-0.211]

82 to dissimilarities through Jlj = Cii + Cjj - 2Cij, where Cij denotes the similarity between items i and j in feature space. [sent-208, score-0.343]

83 Then we see that the similarity in feature space is given by Cij = ¢(Xi). [sent-209, score-0.162]

84 Xj (the similarity in input space), we see then that the similarity in feature space is a non-linear function of the similarity measured in input space. [sent-212, score-0.25]

85 5 Choice of kernel Having performed kernel MDS one can plot the scatter diagram (or Shepard diagram) of the dissimilarities against the fitted distances. [sent-213, score-0.879]

86 We know that for each pair the fitted distance d ij ::; Jij because of the projection property in feature space. [sent-214, score-0.255]

87 The sum of the residuals is given by 2n E~=k+l Ai where the {Ai} are the eigenvalues of k = H K H. [sent-215, score-0.127]

88 (See Theorem 1 above and recall that at most n of the eigenvalues of the covariance matrix in feature space will be non-zero. [sent-216, score-0.361]

89 ) Hence the fraction of the sum-squared distance explained by the first k dimensions is 'Y = E:=1 Ad E~=1 Ai. [sent-217, score-0.151]

90 One idea for choosing the kernel would be to fix the dimensionality k and choose r(·) so that 'Y is maximized. [sent-218, score-0.329]

91 Consider the effect of varying () in the RBF kernel k(Xi , Xj) =exp{-()(xi-xjf(Xi-Xj)}. [sent-219, score-0.294]

92 (5) As () -+ 00 we have Jlj = 2(1- c5(i,j)) (where c5(i,j) is the Kronecker delta), which are the distances corresponding to a regular simplex. [sent-220, score-0.115]

93 Letting () -+ 0 and using e- oz ~ 1- ()z for small (), we can show that Kij = 1 - ()c5lj as () -+ 0, and thus that the classical scaling solution is obtained in this limit. [sent-222, score-0.477]

94 By choosing an index k one can observe from Figure 1 what fraction of the variance is explained by the first k eigenvalues. [sent-228, score-0.147]

95 6 Discussion The results above show that kernel PCA using an isotropic kernel function can be interpreted as performing a kind of metric MDS. [sent-231, score-1.008]

96 The main difference between the kernel MDS algorithm and other metric MDS algorithms is that kernel MDS uses the classical scaling solution in feature space. [sent-232, score-1.312]

97 The advantage of the classical scaling solution is that it is computed from an eigenproblem, and avoids the iterative optimization of the stress objective function that is used for most other MDS solutions. [sent-233, score-0.566]

98 The classical scaling solution is unique up to the unavoidable translation, rotation and reflection symmetries (assuming that there are no repeated eigenvalues). [sent-234, score-0.513]

99 Critchley's work (1978) is somewhat similar to kernel MDS, but it lacks the notion of a projection into feature space and does not always ensure that the matrix B is non-negative definite. [sent-235, score-0.552]

100 We have also looked at the question of adapting the kernel so as to minimize the sum of the residuals. [sent-236, score-0.318]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mds', 0.545), ('kernel', 0.294), ('dissimilarities', 0.237), ('scaling', 0.235), ('classical', 0.217), ('metric', 0.185), ('hx', 0.184), ('isotropic', 0.148), ('xi', 0.136), ('xj', 0.131), ('eigenvalues', 0.127), ('critchley', 0.119), ('interpoint', 0.119), ('distances', 0.115), ('pca', 0.109), ('coordinates', 0.104), ('dissimilarity', 0.103), ('cox', 0.102), ('matrix', 0.09), ('configuration', 0.08), ('projection', 0.074), ('dij', 0.074), ('multidimensional', 0.072), ('eigendecomposition', 0.071), ('mardia', 0.071), ('principal', 0.069), ('points', 0.069), ('pea', 0.066), ('stress', 0.064), ('feature', 0.062), ('kernels', 0.062), ('oij', 0.061), ('eigenvectors', 0.058), ('centering', 0.055), ('dii', 0.055), ('remark', 0.055), ('trick', 0.055), ('relationship', 0.054), ('explained', 0.053), ('onto', 0.051), ('covariance', 0.05), ('euclidean', 0.049), ('bela', 0.047), ('dri', 0.047), ('edinburgh', 0.047), ('eigenproblem', 0.047), ('hxf', 0.047), ('jlj', 0.047), ('kruskal', 0.047), ('nonmetric', 0.047), ('sammon', 0.047), ('wii', 0.047), ('ns', 0.046), ('al', 0.044), ('similarity', 0.044), ('axes', 0.043), ('eigenvector', 0.042), ('nf', 0.041), ('cpd', 0.037), ('rth', 0.037), ('cij', 0.037), ('rotation', 0.036), ('choosing', 0.035), ('section', 0.035), ('fraction', 0.035), ('definite', 0.034), ('ai', 0.033), ('performing', 0.032), ('distance', 0.032), ('kij', 0.032), ('space', 0.032), ('ij', 0.031), ('dimensions', 0.031), ('reconstructed', 0.03), ('verlag', 0.03), ('monotonic', 0.03), ('spline', 0.03), ('stationary', 0.03), ('vi', 0.029), ('monotonically', 0.029), ('property', 0.028), ('interpreted', 0.028), ('projecting', 0.028), ('fitted', 0.028), ('aij', 0.028), ('original', 0.027), ('kind', 0.027), ('ad', 0.027), ('diagram', 0.026), ('rank', 0.026), ('aim', 0.026), ('ap', 0.026), ('conditionally', 0.026), ('solution', 0.025), ('objective', 0.025), ('variance', 0.024), ('see', 0.024), ('minimize', 0.024), ('derived', 0.023), ('yi', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

Author: Christopher K. I. Williams

Abstract: In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed. 1

2 0.26042011 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1

3 0.25284818 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

4 0.1530339 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1

5 0.11332321 110 nips-2000-Regularization with Dot-Product Kernels

Author: Alex J. Smola, Zoltán L. Óvári, Robert C. Williamson

Abstract: In this paper we give necessary and sufficient conditions under which kernels of dot product type k(x, y) = k(x . y) satisfy Mercer's condition and thus may be used in Support Vector Machines (SVM), Regularization Networks (RN) or Gaussian Processes (GP). In particular, we show that if the kernel is analytic (i.e. can be expanded in a Taylor series), all expansion coefficients have to be nonnegative. We give an explicit functional form for the feature map by calculating its eigenfunctions and eigenvalues. 1

6 0.10523777 4 nips-2000-A Linear Programming Approach to Novelty Detection

7 0.096899994 27 nips-2000-Automatic Choice of Dimensionality for PCA

8 0.094695941 75 nips-2000-Large Scale Bayes Point Machines

9 0.093710072 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

10 0.092565119 133 nips-2000-The Kernel Gibbs Sampler

11 0.090993837 120 nips-2000-Sparse Greedy Gaussian Process Regression

12 0.085446954 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

13 0.085027799 58 nips-2000-From Margin to Sparsity

14 0.083325006 74 nips-2000-Kernel Expansions with Unlabeled Examples

15 0.074802965 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

16 0.068452984 54 nips-2000-Feature Selection for SVMs

17 0.068000168 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

18 0.063313723 12 nips-2000-A Support Vector Method for Clustering

19 0.058096342 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

20 0.057613458 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.228), (1, 0.145), (2, -0.046), (3, 0.132), (4, -0.049), (5, 0.384), (6, -0.083), (7, 0.007), (8, -0.071), (9, 0.02), (10, -0.058), (11, -0.066), (12, -0.102), (13, 0.031), (14, -0.041), (15, 0.018), (16, 0.157), (17, 0.064), (18, 0.008), (19, 0.047), (20, -0.047), (21, 0.116), (22, -0.106), (23, -0.017), (24, 0.026), (25, 0.187), (26, -0.046), (27, -0.05), (28, -0.049), (29, -0.029), (30, -0.01), (31, -0.06), (32, -0.028), (33, 0.04), (34, 0.016), (35, -0.021), (36, -0.013), (37, -0.004), (38, -0.031), (39, 0.017), (40, -0.069), (41, 0.021), (42, 0.018), (43, -0.052), (44, 0.069), (45, -0.017), (46, -0.065), (47, 0.003), (48, -0.065), (49, -0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9737342 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

Author: Christopher K. I. Williams

Abstract: In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed. 1

2 0.83762419 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1

3 0.81905043 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

4 0.58535957 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1

5 0.56591797 110 nips-2000-Regularization with Dot-Product Kernels

Author: Alex J. Smola, Zoltán L. Óvári, Robert C. Williamson

Abstract: In this paper we give necessary and sufficient conditions under which kernels of dot product type k(x, y) = k(x . y) satisfy Mercer's condition and thus may be used in Support Vector Machines (SVM), Regularization Networks (RN) or Gaussian Processes (GP). In particular, we show that if the kernel is analytic (i.e. can be expanded in a Taylor series), all expansion coefficients have to be nonnegative. We give an explicit functional form for the feature map by calculating its eigenfunctions and eigenvalues. 1

6 0.46340832 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

7 0.45231467 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

8 0.41902381 27 nips-2000-Automatic Choice of Dimensionality for PCA

9 0.38987482 74 nips-2000-Kernel Expansions with Unlabeled Examples

10 0.38677436 4 nips-2000-A Linear Programming Approach to Novelty Detection

11 0.37119919 120 nips-2000-Sparse Greedy Gaussian Process Regression

12 0.36720082 133 nips-2000-The Kernel Gibbs Sampler

13 0.34537289 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

14 0.32396957 12 nips-2000-A Support Vector Method for Clustering

15 0.32125384 23 nips-2000-An Adaptive Metric Machine for Pattern Classification

16 0.31678975 79 nips-2000-Learning Segmentation by Random Walks

17 0.30092403 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

18 0.29963914 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech

19 0.2916148 148 nips-2000-`N-Body' Problems in Statistical Learning

20 0.28840029 54 nips-2000-Feature Selection for SVMs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.183), (10, 0.022), (17, 0.165), (32, 0.031), (33, 0.047), (55, 0.016), (62, 0.024), (65, 0.01), (67, 0.073), (75, 0.031), (76, 0.183), (79, 0.02), (81, 0.013), (90, 0.039), (91, 0.014), (97, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89483488 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

Author: Christopher K. I. Williams

Abstract: In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed. 1

2 0.88257092 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

Author: Hervé Bourlard, Samy Bengio, Katrin Weber

Abstract: In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent)

3 0.85389107 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images

Author: Rafal Bogacz, Malcolm W. Brown, Christophe G. Giraud-Carrier

Abstract: Olshausen & Field demonstrated that a learning algorithm that attempts to generate a sparse code for natural scenes develops a complete family of localised, oriented, bandpass receptive fields, similar to those of 'simple cells' in VI. This paper describes an algorithm which finds a sparse code for sequences of images that preserves information about the input. This algorithm when trained on natural video sequences develops bases representing the movement in particular directions with particular speeds, similar to the receptive fields of the movement-sensitive cells observed in cortical visual areas. Furthermore, in contrast to previous approaches to learning direction selectivity, the timing of neuronal activity encodes the phase of the movement, so the precise timing of spikes is crucially important to the information encoding.

4 0.77041036 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

Author: Dörthe Malzahn, Manfred Opper

Abstract: Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves for Gaussian process regression models. The approximation works well in the large sample size limit and for arbitrary dimensionality of the input space. We explain how the approximation can be systematically improved and argue that similar techniques can be applied to general likelihood models. 1

5 0.74530363 110 nips-2000-Regularization with Dot-Product Kernels

Author: Alex J. Smola, Zoltán L. Óvári, Robert C. Williamson

Abstract: In this paper we give necessary and sufficient conditions under which kernels of dot product type k(x, y) = k(x . y) satisfy Mercer's condition and thus may be used in Support Vector Machines (SVM), Regularization Networks (RN) or Gaussian Processes (GP). In particular, we show that if the kernel is analytic (i.e. can be expanded in a Taylor series), all expansion coefficients have to be nonnegative. We give an explicit functional form for the feature map by calculating its eigenfunctions and eigenvalues. 1

6 0.72853804 122 nips-2000-Sparse Representation for Gaussian Process Models

7 0.71561927 37 nips-2000-Convergence of Large Margin Separable Linear Classification

8 0.71348017 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

9 0.70168036 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

10 0.68976951 13 nips-2000-A Tighter Bound for Graphical Models

11 0.68876684 74 nips-2000-Kernel Expansions with Unlabeled Examples

12 0.67743629 4 nips-2000-A Linear Programming Approach to Novelty Detection

13 0.67522454 146 nips-2000-What Can a Single Neuron Compute?

14 0.67341846 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

15 0.67324567 134 nips-2000-The Kernel Trick for Distances

16 0.67046261 92 nips-2000-Occam's Razor

17 0.66739267 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

18 0.66701448 79 nips-2000-Learning Segmentation by Random Walks

19 0.66693759 60 nips-2000-Gaussianization

20 0.66208714 21 nips-2000-Algorithmic Stability and Generalization Performance