nips nips2001 nips2001-170 knowledge-graph by maker-knowledge-mining

170 nips-2001-Spectral Kernel Methods for Clustering


Source: pdf

Author: Nello Cristianini, John Shawe-Taylor, Jaz S. Kandola

Abstract: In this paper we introduce new algorithms for unsupervised learning based on the use of a kernel matrix. All the information required by such algorithms is contained in the eigenvectors of the matrix or of closely related matrices. We use two different but related cost functions, the Alignment and the 'cut cost'. The first one is discussed in a companion paper [3], the second one is based on graph theoretic concepts. Both functions measure the level of clustering of a labeled dataset, or the correlation between data clusters and labels. We state the problem of unsupervised learning as assigning labels so as to optimize these cost functions. We show how the optimal solution can be approximated by slightly relaxing the corresponding optimization problem, and how this corresponds to using eigenvector information. The resulting simple algorithms are tested on real world data with positive results. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract In this paper we introduce new algorithms for unsupervised learning based on the use of a kernel matrix. [sent-5, score-0.382]

2 All the information required by such algorithms is contained in the eigenvectors of the matrix or of closely related matrices. [sent-6, score-0.308]

3 We use two different but related cost functions, the Alignment and the 'cut cost'. [sent-7, score-0.16]

4 The first one is discussed in a companion paper [3], the second one is based on graph theoretic concepts. [sent-8, score-0.167]

5 Both functions measure the level of clustering of a labeled dataset, or the correlation between data clusters and labels. [sent-9, score-0.226]

6 We state the problem of unsupervised learning as assigning labels so as to optimize these cost functions. [sent-10, score-0.282]

7 We show how the optimal solution can be approximated by slightly relaxing the corresponding optimization problem, and how this corresponds to using eigenvector information. [sent-11, score-0.31]

8 A general algorithm can be selected for the appropriate task before being mapped onto a particular application through the choice of a problem specific kernel function. [sent-14, score-0.302]

9 The kernel based method works by mapping data to a high dimensional feature space implicitly defined by the choice of the kernel function. [sent-15, score-0.721]

10 The kernel function computes the inner product of the images of two inputs in the feature space. [sent-16, score-0.376]

11 From a practitioners viewpoint this function can also be regarded as a similarity measure and hence provides a natural way of incorporating domain knowledge about the problem into the bias of the system. [sent-17, score-0.143]

12 One important learning problem is that of dividing the data into classes according to a cost function together with their relative positions in the feature space. [sent-18, score-0.27]

13 We can think of this as clustering in the kernel defined feature space, or non-linear clustering in the input space. [sent-19, score-0.61]

14 They both assume that a kernel has been chosen and the kernel matrix constructed. [sent-21, score-0.711]

15 The methods then make use of the matrix's eigenvectors, or of the eigenvectors of the closely related Laplacian matrix, in order to infer a label assignment that approximately optimizes one of two cost functions . [sent-22, score-0.451]

16 See also [4] for use of spectral decompositions of the kernel matrix. [sent-23, score-0.364]

17 2 Two partition cost measures All the information needed to specify a clustering of a set of data is contained in the matrix Mij = (cluster(xi) == cluster(xj)), where (A == B) E {-I, +1}. [sent-25, score-0.451]

18 After a clustering is specified, one can measure its cost in many ways. [sent-26, score-0.318]

19 We propose here two cost functions that are easy to compute and lead to efficient algorithms. [sent-27, score-0.16]

20 Typically one would expect points with similar labels to be clustered and the clusters to be separated. [sent-29, score-0.175]

21 This can be detected in two ways: either by measuring the amount of label-clustering or by measuring the correlation between such variables. [sent-30, score-0.147]

22 In the first case, we need to measure how points of the same class are close to each other and distant from points of different classes. [sent-31, score-0.189]

23 In the second case, kernels can be regarded as oracles predicting whether two points are in the same class. [sent-32, score-0.123]

24 The 'true' oracle is the one that knows the true matrix M. [sent-33, score-0.107]

25 A measure of quality can be obtained by measuring the Pearson correlation coefficient between the kernel matrix K and the true M . [sent-34, score-0.575]

26 Both approaches lead to the same quantity, known as the alignment [3]. [sent-35, score-0.551]

27 Definition 1 Alignment The (empirical) alignment of a kernel kl with a kernel k2 with respect to the sample S is the quantity . [sent-38, score-1.25]

28 1(S,k1 ,k2) = (K 1 ,K2)F , yi(K1 ,K1 )F(K2,K2)F where Ki is the kernel matrix for the sample S using kernel k i . [sent-41, score-0.711]

29 If we consider k2 = yy', where y is the vector of { -1, + I} labels for the sample, then with a slight abuse of notation A A (Sk)= , ,y V / (K,yy')F (K,yY')F. [sent-43, score-0.141]

30 (") = mllKllF ' smce yy,yy F = (K, K) F (YY' , yy') F 2 m Another measure of separation between classes is the average separation between two points in different classes, again normalised by the matrix norm. [sent-44, score-0.239]

31 The cut cost of a clustering is defined as k(Xi XJ) L. [sent-46, score-0.447]

32 This quantity is motivated by a graph theoretic concept. [sent-51, score-0.137]

33 If we consider the Kernel matrix as the adjacency matrix of a fully connected weighted graph whose nodes are the data points, the cost of partitioning a graph is given by the total weight of the edges that one needs to cut or remove, and is exactly the numerator of the 'cut cost'. [sent-52, score-0.683]

34 Notice also the relation between alignment and cutcost: . [sent-53, score-0.551]

35 Among other appealing properties of the alignment, is that this quantity is sharply concentrated around its mean, as proven in the companion paper [3]. [sent-62, score-0.155]

36 This shows that the expected alignment can be reliably estimated from its empirical estimate A. [sent-63, score-0.551]

37 As the cut cost can be expressed as the difference of two alignments C(S,k,y) = O. [sent-65, score-0.293]

38 3 Optimising the cost with spectral techniques In this section we will introduce and test two related methods for clustering, as well as their extensions to transduction. [sent-68, score-0.222]

39 The general problem we want to solve is to assign class-labels to datapoints so as to maximize one of the two cost functions given above. [sent-69, score-0.16]

40 The difference between the approaches is in the two approximation algorithms developed for the different cost functions. [sent-71, score-0.193]

41 The approximation algorithms are obtained by relaxing the discrete problems of optimising over all possible labellings of a dataset to closely related continuous problems solved by eigenvalue decompositions. [sent-72, score-0.44]

42 See [5] for use of eigenvectors in partitioning sparse matrices. [sent-73, score-0.196]

43 After solving the relaxed problem, we can obtain an approximate discrete solution by choosing a suitable threshold to the entries in the vector y and applying the sign function. [sent-77, score-0.391]

44 One can now transform the vector v into a vector in {-I, +l}m by choosing the threshold 8 that gives maximum alignment of y = sign(vrnaX - 8). [sent-84, score-0.799]

45 One can hence estimate the quality of a dichotomy by comparing its value with the upper bound. [sent-89, score-0.126]

46 The absolute alignment tells us how specialized a kernel is on a given dataset: the higher this quantity, the more committed to a specific dichotomy. [sent-90, score-0.853]

47 The first eigenvector can be calculated in many ways, for example the Lanczos procedure, which is already effective for large datasets. [sent-91, score-0.295]

48 Search engines like Google are based on estimating the first eigenvector of a matrix with dimensionality more than 10 9 , so for very large datasets there are approximation techniques. [sent-92, score-0.437]

49 We preprocessed the data by normalising the input vectors in the kernel defined feature space and then centering them by shifting the origin (of the feature space) to their centre of gravity. [sent-94, score-0.488]

50 This can be achieved by the following transformation of the kernel matrix, K +--- K - m - 1jg' - m - 1gj' + m - 2 j'KjJ, where j is the all ones vector, J the all ones matrix and 9 the vector of row sums of K. [sent-95, score-0.53]

51 Eigenva lue Number (a) (b) Figure 1: (a) Plot of alignment of the different eigenvectors with the labels ordered by increasing eigenvalue. [sent-96, score-0.761]

52 1(S, k, y) for y = sign(v maX - (}i ) (bottom curve), and the accuracy of y (middle curve) against threshold number i. [sent-101, score-0.233]

53 The first experiment applied the unsupervised technique to the Breast Cancer data with a linear kernel. [sent-102, score-0.119]

54 Figure l(a) shows the alignmment of the different eigenvectors with the labels. [sent-103, score-0.135]

55 The highest alignment is shown by the last eigenvector corresponding to the largest eigenvalue. [sent-104, score-0.811]

56 For each value (}i of the threshold Figure l(b) shows the upper bound of . [sent-105, score-0.189]

57 1(S, k, y) for y = sign( v max - (}i) (bottom curve), and t he accuracy of y (middle curve). [sent-109, score-0.173]

58 Notice that where actual alignment and upper bound on alignment get closest, we have confidence that we have partitioned our data well, and in fact the accuracy is also maximized. [sent-110, score-1.273]

59 Notice also that the choice of the threshold corresponds to maintaining the correct proportion between positives and negatives. [sent-111, score-0.177]

60 This suggests another possible t hreshold selection strategy, based on the availability of enough labeled points to give a good estimate of the proportion of positive points in the dataset. [sent-112, score-0.143]

61 It is a measure of how naturally t he data separates that t his procedure is able to optimise the split with an accuracy of approximately 97. [sent-115, score-0.292]

62 29% by choosing the threshold that maximises the alignment (threshold number 435) but without making any use of the labels. [sent-116, score-0.758]

63 In Figure 2a we present the same results for the Gaussian kernel (u = 6). [sent-117, score-0.302]

64 In this case the accuracy obtained by optimising the alignment (threshold number 316) of t he resulting dichotomy is less impressive being only about 79. [sent-118, score-0.863]

65 Here the accuracy of the split that optimises the alignment (threshold number 158) is approximately (a) (b) Figure 2: Plot for Breast Cancer data (Gaussian kernel) (a) and Ionosphere data (linear kernel) (b) of Amax/ilKIIF (straight line), . [sent-121, score-0.761]

66 4(S, k, y) for y = sign(v maX - ()i) (bottom curve), and the accuracy of y (middle curve) against threshold number i. [sent-122, score-0.233]

67 We can also use the overall approach to adapt the kernel to the data. [sent-125, score-0.302]

68 For example we can choose the kernel parameters so as to optimize Amax/IIKIIF. [sent-126, score-0.302]

69 Then find the first eigenvector, choose a threshold to maximise the alignment and output the corresponding y. [sent-127, score-0.813]

70 The cost to the alignment of changing a label Yi is 2 Lj Yjk(Xi' xj)/IIKIIF , so that if a point is isolated from the others, or if it is equally close to the two different classes, then changing its label will have only a very small effect. [sent-128, score-0.951]

71 On the other hand, labels in strongly clustered points clearly contribute to the overall cost and changing their label will alter the alignment significantly. [sent-129, score-1.006]

72 We consider embedding the set into the real line, so as to satisfy a clustering criterion. [sent-133, score-0.114]

73 The resulting Kernel matrix should appear as a block diagonal matrix. [sent-134, score-0.137]

74 In those cases, the eigenvectors of the Laplacian have been used, and the approach is called the Fiedler ordering. [sent-136, score-0.135]

75 Although the Fiedler ordering could be used here as well, we present here a variation based on the simple kernel matrix. [sent-137, score-0.302]

76 It is maximized when points with high similarity have the same sign and high absolute value, and when points with different sign have low similarity. [sent-140, score-0.292]

77 The choice of coordinates v that optimizes this cost is the first eigenvector, and hence by sorting the data according to the value of their entry in this eigenvector one can hope to find a good permutation, that renders the kernel matrix block diagonal. [sent-141, score-1.083]

78 Figure 3 shows the results of this heuristic applied to the Breast cancer dataset. [sent-142, score-0.157]

79 The grey level indicates the size of the kernel entry. [sent-143, score-0.302]

80 ( ) "'"' Kij ="2 ~ Kij - Y' Ky ="2 Y, Ly, 1",", 1 We can express this quantity as ~ ydy; i,j where L is the Laplacian matrix, defined as L = D-K, where D = diag(d l , . [sent-149, score-0.104]

81 One would like to find y E {-l,+l}m so as to minimize the cut cost subject to the division being even, but this problem is NP-hard. [sent-153, score-0.326]

82 Following the same strategy as with the alignment we can impose a slightly looser constraint on y, y E Rm, '拢i yt = m, l:i Yi = O. [sent-154, score-0.587]

83 Since, zero is an eigenvalue of L with eigenvector j, the all ones vector, the problem is equivalent to finding the eigenvector of the smallest non-zero eigenvalue . [sent-156, score-0.727]

84 So the eigenvector corresponding to the eigenvalue . [sent-167, score-0.342]

85 One can now threshold the entries of the eigenvector in order to obtain a vector with -1 and + 1 entries. [sent-172, score-0.487]

86 We applied the procedure to the Breast cancer data with both linear and Gaussian kernels. [sent-174, score-0.194]

87 Now using the cut cost to select the best threshold for the linear kernel sets it at 378 with an accuracy of 67. [sent-176, score-0.828]

88 86%, significantly worse than the results obtained by optimising the alignment. [sent-177, score-0.164]

89 With the Gaussian kernel, on the other hand, the method selects threshold 312 with an accuracy of 80. [sent-178, score-0.233]

90 31 %, a slight improvement over the results obtained with this kernel by optimising the alignment. [sent-179, score-0.497]

91 The idea that some labelled data might improve performance comes from observing Figure 4b, where the selection based on the cut-cost is clearly suboptimal. [sent-183, score-0.094]

92 By incorporating some label information, it is hoped that we can obtain an improved threshold selection. [sent-184, score-0.259]

93 0050 '---------c:c------c=--=--~-_=_-_=_-~ (a) (b) Figure 4: Plot for Breast Cancer data using (a) Linear kernel) and (b) Gaussian kernel of C(S,k,y) - ,X/(21IKIIF) (dashed curves), for y = sign(v maX - ()i) and the error of y (solid curve) against threshold number i. [sent-185, score-0.483]

94 Let z be the vector containing the known labels and 0 elsewhere. [sent-186, score-0.11]

95 We now use the original matrix K to generate the eigenvector, but the matrix K P when measuring the cut-cost of the classifications generated by different thresholds. [sent-188, score-0.272]

96 67%) for the Breast cancer data with Gaussian kernel, a marked improvement over the 80. [sent-191, score-0.194]

97 4 Conclusions The paper has considered two partition costs the first derived from the so-called alignment of a kernel to a label vector, and the second from the cut-cost of a label vector for a given kernel matrix. [sent-193, score-1.428]

98 The two quantities are both optimised by the same labelling, but give rise to different approximation algorithms when the discrete constraint is removed from the labelling vector. [sent-194, score-0.134]

99 It was shown how these relaxed problems are solved exactly using spectral t echniques, hence leading to two distinct approximation algorithms through a post-processing phase that re-discretises the vector to create a labelling that is chosen to optimise the given criterion. [sent-195, score-0.422]

100 Experiments are presented showing the performance of both of these clustering techniques with some very striking results. [sent-196, score-0.114]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('alignment', 0.551), ('kernel', 0.302), ('eigenvector', 0.26), ('optimising', 0.164), ('cost', 0.16), ('cancer', 0.157), ('threshold', 0.144), ('eigenvectors', 0.135), ('cut', 0.133), ('breast', 0.133), ('clustering', 0.114), ('matrix', 0.107), ('labelling', 0.101), ('laplacian', 0.094), ('sign', 0.091), ('accuracy', 0.089), ('label', 0.085), ('max', 0.084), ('eigenvalue', 0.082), ('jaz', 0.08), ('labels', 0.075), ('optimise', 0.075), ('yy', 0.075), ('nello', 0.07), ('curve', 0.069), ('dirn', 0.067), ('fiedler', 0.067), ('xj', 0.066), ('plot', 0.064), ('quantity', 0.064), ('ly', 0.064), ('spectral', 0.062), ('partitioning', 0.061), ('cristianini', 0.061), ('dichotomy', 0.059), ('companion', 0.059), ('yi', 0.058), ('measuring', 0.058), ('xi', 0.058), ('labelled', 0.057), ('points', 0.055), ('definition', 0.052), ('ye', 0.05), ('relaxing', 0.05), ('maximise', 0.05), ('ionosphere', 0.05), ('kij', 0.05), ('straight', 0.049), ('entries', 0.048), ('unsupervised', 0.047), ('split', 0.047), ('sorting', 0.047), ('min', 0.046), ('bound', 0.045), ('clustered', 0.045), ('measure', 0.044), ('solved', 0.043), ('ones', 0.043), ('gram', 0.041), ('feature', 0.04), ('defined', 0.04), ('middle', 0.039), ('graph', 0.039), ('rm', 0.039), ('relaxed', 0.039), ('john', 0.038), ('optimizes', 0.038), ('data', 0.037), ('line', 0.036), ('yt', 0.036), ('dataset', 0.035), ('first', 0.035), ('vector', 0.035), ('changing', 0.035), ('datasets', 0.035), ('regarded', 0.035), ('inner', 0.034), ('choosing', 0.034), ('theoretic', 0.034), ('notice', 0.034), ('hence', 0.034), ('algorithms', 0.033), ('kernels', 0.033), ('closely', 0.033), ('classes', 0.033), ('partition', 0.033), ('proportion', 0.033), ('permutation', 0.033), ('find', 0.033), ('quality', 0.033), ('concentrated', 0.032), ('kl', 0.031), ('correlation', 0.031), ('slight', 0.031), ('bottom', 0.03), ('block', 0.03), ('incorporating', 0.03), ('ferent', 0.029), ('maximises', 0.029), ('normalising', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 170 nips-2001-Spectral Kernel Methods for Clustering

Author: Nello Cristianini, John Shawe-Taylor, Jaz S. Kandola

Abstract: In this paper we introduce new algorithms for unsupervised learning based on the use of a kernel matrix. All the information required by such algorithms is contained in the eigenvectors of the matrix or of closely related matrices. We use two different but related cost functions, the Alignment and the 'cut cost'. The first one is discussed in a companion paper [3], the second one is based on graph theoretic concepts. Both functions measure the level of clustering of a labeled dataset, or the correlation between data clusters and labels. We state the problem of unsupervised learning as assigning labels so as to optimize these cost functions. We show how the optimal solution can be approximated by slightly relaxing the corresponding optimization problem, and how this corresponds to using eigenvector information. The resulting simple algorithms are tested on real world data with positive results. 1

2 0.65167493 134 nips-2001-On Kernel-Target Alignment

Author: Nello Cristianini, John Shawe-Taylor, André Elisseeff, Jaz S. Kandola

Abstract: We introduce the notion of kernel-alignment, a measure of similarity between two kernel functions or between a kernel and a target function. This quantity captures the degree of agreement between a kernel and a given learning task, and has very natural interpretations in machine learning, leading also to simple algorithms for model selection and learning. We analyse its theoretical properties, proving that it is sharply concentrated around its expected value, and we discuss its relation with other standard measures of performance. Finally we describe some of the algorithms that can be obtained within this framework, giving experimental results showing that adapting the kernel to improve alignment on the labelled data significantly increases the alignment on the test set, giving improved classification accuracy. Hence, the approach provides a principled method of performing transduction. Keywords: Kernels, alignment, eigenvectors, eigenvalues, transduction 1

3 0.22625421 164 nips-2001-Sampling Techniques for Kernel Methods

Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf

Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.

4 0.19826205 171 nips-2001-Spectral Relaxation for K-means Clustering

Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon

Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1

5 0.19749486 58 nips-2001-Covariance Kernels from Bayesian Generative Models

Author: Matthias Seeger

Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1

6 0.194032 136 nips-2001-On the Concentration of Spectral Properties

7 0.18333249 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

8 0.15552862 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

9 0.1424897 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

10 0.14075121 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

11 0.13282965 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

12 0.12148366 74 nips-2001-Face Recognition Using Kernel Methods

13 0.11755806 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

14 0.10318471 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

15 0.10268387 89 nips-2001-Grouping with Bias

16 0.092787504 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

17 0.090078749 60 nips-2001-Discriminative Direction for Kernel Classifiers

18 0.086552076 185 nips-2001-The Method of Quantum Clustering

19 0.084899433 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates

20 0.082765169 56 nips-2001-Convolution Kernels for Natural Language


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.299), (1, 0.214), (2, -0.074), (3, -0.338), (4, 0.309), (5, 0.198), (6, -0.031), (7, 0.068), (8, -0.095), (9, 0.224), (10, -0.074), (11, 0.032), (12, -0.06), (13, 0.117), (14, -0.244), (15, -0.046), (16, 0.004), (17, -0.081), (18, 0.033), (19, 0.02), (20, 0.004), (21, 0.086), (22, -0.216), (23, 0.067), (24, 0.053), (25, -0.039), (26, 0.029), (27, -0.04), (28, 0.117), (29, 0.099), (30, 0.242), (31, -0.069), (32, -0.053), (33, 0.101), (34, 0.068), (35, -0.096), (36, 0.024), (37, 0.037), (38, -0.042), (39, 0.052), (40, 0.011), (41, -0.143), (42, 0.032), (43, -0.031), (44, -0.053), (45, 0.007), (46, 0.09), (47, -0.029), (48, 0.043), (49, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95843822 170 nips-2001-Spectral Kernel Methods for Clustering

Author: Nello Cristianini, John Shawe-Taylor, Jaz S. Kandola

Abstract: In this paper we introduce new algorithms for unsupervised learning based on the use of a kernel matrix. All the information required by such algorithms is contained in the eigenvectors of the matrix or of closely related matrices. We use two different but related cost functions, the Alignment and the 'cut cost'. The first one is discussed in a companion paper [3], the second one is based on graph theoretic concepts. Both functions measure the level of clustering of a labeled dataset, or the correlation between data clusters and labels. We state the problem of unsupervised learning as assigning labels so as to optimize these cost functions. We show how the optimal solution can be approximated by slightly relaxing the corresponding optimization problem, and how this corresponds to using eigenvector information. The resulting simple algorithms are tested on real world data with positive results. 1

2 0.9233073 134 nips-2001-On Kernel-Target Alignment

Author: Nello Cristianini, John Shawe-Taylor, André Elisseeff, Jaz S. Kandola

Abstract: We introduce the notion of kernel-alignment, a measure of similarity between two kernel functions or between a kernel and a target function. This quantity captures the degree of agreement between a kernel and a given learning task, and has very natural interpretations in machine learning, leading also to simple algorithms for model selection and learning. We analyse its theoretical properties, proving that it is sharply concentrated around its expected value, and we discuss its relation with other standard measures of performance. Finally we describe some of the algorithms that can be obtained within this framework, giving experimental results showing that adapting the kernel to improve alignment on the labelled data significantly increases the alignment on the test set, giving improved classification accuracy. Hence, the approach provides a principled method of performing transduction. Keywords: Kernels, alignment, eigenvectors, eigenvalues, transduction 1

3 0.55577487 164 nips-2001-Sampling Techniques for Kernel Methods

Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf

Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.

4 0.49437886 58 nips-2001-Covariance Kernels from Bayesian Generative Models

Author: Matthias Seeger

Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1

5 0.47076884 136 nips-2001-On the Concentration of Spectral Properties

Author: John Shawe-Taylor, Nello Cristianini, Jaz S. Kandola

Abstract: We consider the problem of measuring the eigenvalues of a randomly drawn sample of points. We show that these values can be reliably estimated as can the sum of the tail of eigenvalues. Furthermore, the residuals when data is projected into a subspace is shown to be reliably estimated on a random sample. Experiments are presented that confirm the theoretical results. 1

6 0.43989125 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

7 0.4378112 171 nips-2001-Spectral Relaxation for K-means Clustering

8 0.42051864 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

9 0.39645639 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

10 0.35516137 113 nips-2001-Learning a Gaussian Process Prior for Automatically Generating Music Playlists

11 0.33646789 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

12 0.32540801 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

13 0.32085896 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

14 0.31717917 74 nips-2001-Face Recognition Using Kernel Methods

15 0.31717896 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

16 0.29766664 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

17 0.29713747 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates

18 0.29223081 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

19 0.28651217 185 nips-2001-The Method of Quantum Clustering

20 0.27654344 56 nips-2001-Convolution Kernels for Natural Language


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.412), (17, 0.073), (19, 0.019), (20, 0.017), (27, 0.118), (30, 0.05), (59, 0.036), (72, 0.065), (79, 0.024), (91, 0.1)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96501648 49 nips-2001-Citcuits for VLSI Implementation of Temporally Asymmetric Hebbian Learning

Author: A. Bofill, D. P. Thompson, Alan F. Murray

Abstract: Experimental data has shown that synaptic strength modification in some types of biological neurons depends upon precise spike timing differences between presynaptic and postsynaptic spikes. Several temporally-asymmetric Hebbian learning rules motivated by this data have been proposed. We argue that such learning rules are suitable to analog VLSI implementation. We describe an easily tunable circuit to modify the weight of a silicon spiking neuron according to those learning rules. Test results from the fabrication of the circuit using a O.6J.lm CMOS process are given. 1

same-paper 2 0.91737896 170 nips-2001-Spectral Kernel Methods for Clustering

Author: Nello Cristianini, John Shawe-Taylor, Jaz S. Kandola

Abstract: In this paper we introduce new algorithms for unsupervised learning based on the use of a kernel matrix. All the information required by such algorithms is contained in the eigenvectors of the matrix or of closely related matrices. We use two different but related cost functions, the Alignment and the 'cut cost'. The first one is discussed in a companion paper [3], the second one is based on graph theoretic concepts. Both functions measure the level of clustering of a labeled dataset, or the correlation between data clusters and labels. We state the problem of unsupervised learning as assigning labels so as to optimize these cost functions. We show how the optimal solution can be approximated by slightly relaxing the corresponding optimization problem, and how this corresponds to using eigenvector information. The resulting simple algorithms are tested on real world data with positive results. 1

3 0.90634537 139 nips-2001-Online Learning with Kernels

Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson

Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss.     ¡

4 0.83202088 134 nips-2001-On Kernel-Target Alignment

Author: Nello Cristianini, John Shawe-Taylor, André Elisseeff, Jaz S. Kandola

Abstract: We introduce the notion of kernel-alignment, a measure of similarity between two kernel functions or between a kernel and a target function. This quantity captures the degree of agreement between a kernel and a given learning task, and has very natural interpretations in machine learning, leading also to simple algorithms for model selection and learning. We analyse its theoretical properties, proving that it is sharply concentrated around its expected value, and we discuss its relation with other standard measures of performance. Finally we describe some of the algorithms that can be obtained within this framework, giving experimental results showing that adapting the kernel to improve alignment on the labelled data significantly increases the alignment on the test set, giving improved classification accuracy. Hence, the approach provides a principled method of performing transduction. Keywords: Kernels, alignment, eigenvectors, eigenvalues, transduction 1

5 0.82194555 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

Author: Manfred Opper, Robert Urbanczik

Abstract: Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1

6 0.65099138 137 nips-2001-On the Convergence of Leveraging

7 0.62570554 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

8 0.61841273 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

9 0.61099499 58 nips-2001-Covariance Kernels from Bayesian Generative Models

10 0.6089977 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules

11 0.60791498 112 nips-2001-Learning Spike-Based Correlations and Conditional Probabilities in Silicon

12 0.59880006 8 nips-2001-A General Greedy Approximation Algorithm with Applications

13 0.59440082 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

14 0.59103632 74 nips-2001-Face Recognition Using Kernel Methods

15 0.58828479 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

16 0.58784699 22 nips-2001-A kernel method for multi-labelled classification

17 0.57952034 176 nips-2001-Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines

18 0.57750285 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

19 0.57411319 166 nips-2001-Self-regulation Mechanism of Temporally Asymmetric Hebbian Plasticity

20 0.57364136 31 nips-2001-Algorithmic Luckiness