nips nips2001 nips2001-164 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf
Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. [sent-5, score-2.376]
2 Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation. [sent-7, score-0.525]
3 1 Introduction Given a collection of training data , techniques such as linear SVMs [13] and PCA extract features from by computing linear functions of this data. [sent-8, score-0.212]
4 Worse, many data sets do not readily support linear operations such as addition and scalar multiplication (text, for example). [sent-10, score-0.175]
5 ¢¥§§§¥£ ¨©¨¦¤¢ ¡ ¡ In a “kernel method” is first mapped into some dot product space using . [sent-11, score-0.107]
6 Nonetheless, in many cases the dot products can be evaluated efficiently using a positive definite kernel for , ı. [sent-13, score-0.495]
7 Note that at no point is the function explicitly computed; the kernel implicitly performs the dot product calculations between mapped points. [sent-18, score-0.466]
8 While this “kernel trick” has been extremely successful, a problem common to all kernel is a dense matrix, making the input size scale as . [sent-19, score-0.472]
9 For methods is that, in general, example, in Kernel PCA such a matrix has to be diagonalized, while in SVMs a quadratic program of size must be solved. [sent-20, score-0.172]
10 As the size of training sets in practical applications increases, the growth of the input size rapidly poses severe computational limitations. [sent-21, score-0.126]
11 , [10]), speedup methods for Kernel PCA [12], and other kernel methods [2, 14]. [sent-24, score-0.359]
12 Our research is motivated by the need for such speedups that are also accompanied by strong, provable performance guarantees. [sent-25, score-0.1]
13 We start by simplifying the Gram matrix via a novel matrix sampling/quantization scheme, motivated by spectral properties of random matrices. [sent-27, score-0.507]
14 We then move on to speeding up classification, by using randomized rounding in evaluating kernel expansions. [sent-28, score-1.047]
15 Finally, we consider the evaluation of kernel functions themselves and show how many popular kernels can be approximated efficiently. [sent-29, score-0.502]
16 Our first technique relates matrix simplification to the stability of invariant subspaces. [sent-30, score-0.23]
17 The other two are, in fact, completely general and apply to all kernel methods. [sent-31, score-0.359]
18 What is more, our techniques suggest the notion of randomized kernels, whereby each evaluation of the kernel is replaced by an evaluation of a randomized function (on the same input pair). [sent-32, score-1.173]
19 The idea is to use a function which for every input pair behaves like in expectation (over its internal coin-flips), yet confers significant computational benefits compared to using . [sent-33, score-0.104]
20 In fact, each one of our three techniques can be readily cast as an appropriate randomized kernel, with no other intervention. [sent-34, score-0.453]
21 2 Kernel PCA E ¡ E Given training points recall that is an matrix with . [sent-35, score-0.289]
22 The choice depends on the properties of the matrix and on how many components one is seeking. [sent-38, score-0.213]
23 It is worth looking closely at the complexity of performing Orthogonal Iteration on a matrix . [sent-50, score-0.209]
24 The number of iterations of the while loop is a somewhat complicated issue, but one can prove that the “error” in (with respect to the true principal components) decreases exponentially with the number of iterations. [sent-53, score-0.085]
25 All in all, the running time of Orthogonal Iteration scales linearly with the cost of the matrix multiplication . [sent-54, score-0.258]
26 , if roughly one out of every entries of is non-zero, then the matrix multiplication costs . [sent-57, score-0.377]
27 ( E As mentioned earlier, the matrix used in Kernel PCA is almost never sparse. [sent-62, score-0.172]
28 In the next section, we will show how to sample and quantize the entries of , obtaining a matrix which is sparser and whose entries have simpler data representation, yet has essentially the same spectral structure, i. [sent-63, score-0.541]
29 3 Sampling Gram Matrices In this section we describe two general “matrix simplification” techniques and discuss their implications for Kernel PCA. [sent-68, score-0.071]
30 In particular, under natural assumptions on the spectral structure of , we will prove that applying KPCA to the simplified matrix yields subspaces which are very close to those that KPCA would find in . [sent-69, score-0.321]
31 E E )0 )0 £ § ( P© $ F' ¤ £ & 0 2 E '0 & ( © $ F' & 0 2 E '0 £ & ¥ ¨§ © ©¤ E 2 F' E ) ¡ ¢ 8 ' 62 E E 8 G2' E Second, our quantization process rounds each entry in to one of , thus reducing the representation of each entry to a single bit. [sent-71, score-0.204]
32 with probability with probability , where © § £ ¤ ¦¥£ with probability . [sent-72, score-0.162]
33 Precisely stated, 8 E E First, our sparsification process works by randomly omitting entries in we let the matrix be described entrywise as with probability © ¤ © 8 ' 62 E E 8 G2' # # %F2' E $2 "! [sent-73, score-0.476]
34 ' Sparsification greatly accelerates the computation of eigenvectors by accelerating multiplication by . [sent-74, score-0.211]
35 Moreover, both approaches greatly reduce the space required to store the matrix (and they can be readily combined), allowing for much bigger training sets to fit in main memory. [sent-75, score-0.33]
36 Finally, we note that i) sampling also speeds up the construction of the that remain in , while ii) Gram matrix since we need only compute those values of quantization allows us to replace exact kernel evaluations by coarse unbiased estimators, which can be more efficient to compute. [sent-76, score-1.034]
37 Moreover, the entries of the error matrix, , are independent random variables, having expectation zero and bounded variance. [sent-78, score-0.227]
38 Large deviation extensions [5] of Wigner’s famous semi-circle law, imply that with very high probability such matrices have small L2 norm (denoted by throughout). [sent-79, score-0.158]
39 E E ¤ 0 1) 8 G2' 2#¨3 2## # ¡ 0) E 2 8 ( F' $ E ( Theorem 1 (Furedi and Komlos [5]) Let be an symmetric matrix whose entries are independent random variables with mean 0, variance bounded above by , and magnitude bounded by . [sent-80, score-0.547]
40 With probability , I 4 ( H I 4 D H 0 ©H¤GF¤$ E C& ¥£ B ¤ H § 54P H £ H I## 0 ) 2## A@ 7 5 ''2980 H 64 It is worth noting that this upper bound is within a constant factor of the lower bound on the L2 norm of any matrix where the mean squared entry equals . [sent-81, score-0.47]
41 More precisely, it is easy to show that every matrix with Frobenius norm has L2 norm at least . [sent-82, score-0.328]
42 Therefore, we see that the L2 error introduced by is within a factor of 4 of the L2 error associated with any modification to that has the same entrywise mean squared error. [sent-83, score-0.071]
43 At the heart of these results is the stability of invariant subspaces in the presence of additive noise. [sent-85, score-0.105]
44 In stating each of these results, it is important to note that the eigenvectors correspond exactly to the feature extractors associated with Kernel PCA. [sent-87, score-0.259]
45 Observe that in this case we cannot hope to equate all and , as the th feature is very sensitive to small changes in . [sent-90, score-0.085]
46 However, we can show that all features with far from are treated consistently in and . [sent-91, score-0.072]
47 ¦¥ E ¢ ¦¦ ¤ ¢ §§¥ ¥£¡¦ ¥ £ ¥ ¢ E ¢ ¦ §¦ ¥ £ ( ¢&$ ¦¥ ¤ ( ¢ 5&$ ¢ Theorem 2 Let be any matrix whose columns form an orthonormal basis for the space of features (eigenvectors) in whose eigenvalue is at least . [sent-92, score-0.547]
48 Let be any matrix whose columns form an orthonormal basis for the orthogonal complement of . [sent-93, score-0.447]
49 Similarly, the second equation asserts that KPCA will recover all the features of whose eigenvalues are larger than . [sent-96, score-0.319]
50 However, we can matrix can reorder their importance by, say, interchanging show that any such interchange will occur consistently in all test vectors. [sent-102, score-0.172]
51 Let be the -dimensional vector whose th coordinate is , ı. [sent-103, score-0.201]
52 , here we do not normalize features to “equal importance”. [sent-105, score-0.072]
53 Recall that is the vector whose th coordinate is . [sent-106, score-0.201]
54 There is an orthonormal ¢ S ¦§¥ 8 ¢ #2# 0 ) 2## %& 8 © ( ¢ ! [sent-108, score-0.09]
55 Proof: Instantiate Theorem 2 with for some Theorem 3 Assume that rotation matrix such that for all Note that the rotation matrix becomes completely irrelevant if we are only concerned with differences, angles, or inner products of feature vectors. [sent-111, score-0.845]
56 Finally, we prove that in the special case where a feature is well separated from its neighboring features in the spectrum of , we get a particularly strong bound. [sent-112, score-0.207]
57 We will devise a fast, unbiased, small-variance estimator for , by sampling and rounding the expansion coefficients . [sent-117, score-0.395]
58 # ' ¨ # ¥) 3 £ ¡# ' ¨ # ; if with probability ¨ 1) ' 8 ' ¨ then let )0 I## ¨ I## £ ¤@ )0£ ¢ (' ¨$ ¨ ' # ' ¨# T . [sent-118, score-0.114]
59 This makes the natural scale for measuring and suggests that using far fewer than kernel evaluations we can get good approximations of . [sent-127, score-0.618]
60 There are fewer than and " Theorem 5 For any probability at least Proof: Let denote the number of non-zero and let . [sent-133, score-0.162]
61 It is not hard to show that the probability that the event in 1 fails is bounded by the corresponding probability for the case where all coordinates of are equal. [sent-135, score-0.268]
62 In that case, is a Binomial random variable with trials and probability of success and, by our choice of , . [sent-136, score-0.154]
63 The Chernoff bound now implies that the event in 1 fails to occur with probability . [sent-137, score-0.102]
64 is at least For the enent in 2 it suffices to observe that failure occurs only if . [sent-138, score-0.093]
65 H 0 H %& ( H &$ 3 ( 0 2$ "& A 8 ( H $ ( 8 H ) # ( &¢$ ¤ ( &¢$ # ( A ' H &$ 3 ¨ H 5 0) H H # ¨# # H H @ '297 5 " 4( ( )0 & & $ 5 Quick batch approximations of Kernels In this section we devise fast approximations of the kernel function itself. [sent-140, score-0.607]
66 We focus on kernels sharing the following two characteristics: i) they map -dimensional Euclidean space, and, ii) the mapping depends only on the distance and/or inner product of the considered points. [sent-141, score-0.177]
67 A natural approach for creating such an oracle is to pick of the coordinates in input space and use the projection onto these coordinates to determine distances and inner products. [sent-147, score-0.526]
68 The problem with this approach is that if , any coordinate sampling scheme is bound to do poorly. [sent-148, score-0.307]
69 On the other hand, if we knew that all coordinates contributed “approximately equally” to , then coordinate sampling would be much more appealing. [sent-149, score-0.37]
70 We will do just this, using the technique of random projections [8], which can be viewed as coordinate sampling preceded by a random rotation. [sent-150, score-0.435]
71 # ¤ ' ¢ # ¥ ¡ ¨¥©§¨§©§¥ ¡ $ ( ¨©¨¥ ¥ 2 ¥§§§ ¡ ¡ ) ¤'¢ ( 2 ¡¢' ¢ $ ¥ 8 2 ¡ § ¡ ¦ (before training) Imagine that we applied a spherically random rotation to and then applied the same random rotation to each input point as it became available. [sent-151, score-0.562]
72 Clearly, all distances and inner products would remain the same and we would get exactly the same results as without the rotation. [sent-152, score-0.266]
73 The interesting part is that any fixed vector that was a linear combination of training and/or input vectors, e. [sent-153, score-0.126]
74 , after being rotated becomes a spherically random vector of length . [sent-155, score-0.152]
75 ¤ 6 2¡ ' ¦ ¤¢ © ) ¡ # ©# 50# ( 3 © #¥ $ ¡ projection Our oracle amounts to multiplying each training and input point by the same matrix , where , and using the resulting -dimensional points to estimate rotation matrix distances and inner products. [sent-162, score-0.864]
76 Before describing the choice of and the quality of the resulting approximations, let us go over the computational savings. [sent-164, score-0.101]
77 This rotation can be performed in the training phase. [sent-169, score-0.216]
78 The kernel evaluations for each now take instead of 3. [sent-171, score-0.508]
79 ' ( @ ¨I97 ( @ ¨I97 H &$ ' H $ ' H H &$ ' ' ¢ ( @ '297 $ ' ¢ Having motivated our oracle as a spherically random rotation followed by coordinate sampling, we will actually employ a simpler method to perform the projection. [sent-174, score-0.542]
80 Namely, we will rely on a recent result of [1], asserting that we can do at least as well by taking where the are i. [sent-175, score-0.11]
81 Thus, postponing the scaling by until the end, each of the new coordinates is formed as follows: split the coordinates randomly into two groups; sum the coordinates in each group; take the difference of the two sums. [sent-179, score-0.333]
82 © £ ' 5 0 G2B '8& 0 F2' , for every pair of points where the let © ¥ With probability at least . [sent-186, score-0.162]
83 For any be a random matrix defined by , each case having probability F2' § Let (4) ( )0 © ¡ 0 ) (& ¤ ¢ £ I # © # £ I # © # %¥£ $ £ © By our choice of , the r. [sent-187, score-0.326]
84 Thus, by the union bound, with probability at least the lengths of all vectors corresponding to and , , , are maintained within a factor of . [sent-191, score-0.102]
85 ¡ ¦ ¡ ¡ ¡ ¢ ©©§ §§ ¡ I ¤ ¢ # # £ 8 £ T 2¤ £ ¥ ' ¢ ¡ ¨ I ¡ 2 ¤'¢ ¡ § # ¡#¥I# ¢# 6 Conclusion We have described three techniques for speeding up kernel methods through the use of randomization. [sent-194, score-0.571]
86 While the discussion has focused on Kernel PCA, we feel that our techniques have potential for further development and empirical evaluation in a more general setting. [sent-195, score-0.168]
87 Indeed, the methods for sampling kernel expansions and for speeding up the kernel evaluation are universal; also, the Gram matrix sampling is readily applicable to any kernel technique based on the eigendecomposition of the Gram matrix [3]. [sent-196, score-2.077]
88 Furthermore, it might enable us to speed up SVM training by sparsifying the Hessian and then applying a sparse QP solver, such as the ones described in [6, 9]. [sent-197, score-0.109]
89 Our sampling and quantization techniques, both in training and classification, amount to repeatedly replacing single kernel evaluations with independent random variables that have appropriate expectations. [sent-198, score-0.919]
90 #2 # %¤¨ %) with probability ( 2 ¢¥ ' &$ ) ¡¢ 8 ( 2 ¥ ' &$ ¢ ¢ while an analogous randomized kernel is the obvious choice for quantization. [sent-200, score-0.747]
91 We feel that this approach suggests a notion of randomized kernels, wherein kernel evaluations are no longer considered as deterministic but inherently random, providing unbiased estimators for the corresponding inner products. [sent-201, score-1.04]
92 Given bounds on the variance of these estimators, it seems that algorithms which reduce to computing weighted sums of kernel evaluations can exploit concentration of measure. [sent-202, score-0.508]
93 Thus, randomized kernels appear promising as a general tool for speeding up kernel methods, warranting further investigation. [sent-203, score-0.886]
94 BS would like to thank Santosh Venkatesh for detailed discussions on sampling kernel expansions. [sent-205, score-0.518]
95 Kahan, The rotation of eigenvectors by a perturbation 3, SIAM Journal on Numerical Analysis 7 (1970), 1–46. [sent-223, score-0.272]
96 Koml´ s, The eigenvalues of random symmetric matrices, Combinau o torica 1 (1981), no. [sent-226, score-0.219]
97 Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association 58 (1963), 13–30. [sent-234, score-0.108]
98 M¨ ller, Nonlinear component analysis as a o u kernel eigenvalue problem, Neural Computation 10 (1998), 1299–1319. [sent-257, score-0.359]
99 Sch¨ lkopf, Sparse greedy matrix approximation for machine o learning, Proc. [sent-261, score-0.172]
100 Seeger, Using the Nystrom method to speed up kernel machines, Advances in Neural Information Processing Systems 2000, MIT Press, 2001. [sent-270, score-0.359]
wordName wordTfidf (topN-words)
[('kernel', 0.359), ('randomized', 0.293), ('kpca', 0.217), ('rounding', 0.186), ('matrix', 0.172), ('sampling', 0.159), ('gram', 0.151), ('evaluations', 0.149), ('rotation', 0.147), ('speeding', 0.141), ('eigenvalues', 0.126), ('eigenvectors', 0.125), ('entries', 0.119), ('pca', 0.118), ('coordinates', 0.111), ('oracle', 0.105), ('coordinate', 0.1), ('kernels', 0.093), ('spherically', 0.093), ('quantization', 0.09), ('orthonormal', 0.09), ('readily', 0.089), ('multiplication', 0.086), ('orthogonal', 0.085), ('extractors', 0.085), ('inner', 0.084), ('products', 0.074), ('features', 0.072), ('entrywise', 0.071), ('invocation', 0.071), ('kahan', 0.071), ('sparsi', 0.071), ('techniques', 0.071), ('cients', 0.07), ('training', 0.069), ('evaluating', 0.068), ('theorem', 0.067), ('spectral', 0.066), ('whose', 0.065), ('coef', 0.065), ('proof', 0.065), ('asserting', 0.062), ('speedups', 0.062), ('achlioptas', 0.062), ('dot', 0.062), ('let', 0.06), ('approximations', 0.06), ('random', 0.059), ('simpli', 0.059), ('cation', 0.059), ('iteration', 0.059), ('projections', 0.058), ('expansions', 0.058), ('distances', 0.058), ('stability', 0.058), ('input', 0.057), ('entry', 0.057), ('unbiased', 0.057), ('asserts', 0.056), ('dense', 0.056), ('mcsherry', 0.056), ('probability', 0.054), ('norm', 0.054), ('davis', 0.053), ('hoeffding', 0.053), ('lkopf', 0.052), ('estimators', 0.051), ('get', 0.05), ('evaluation', 0.05), ('matrices', 0.05), ('sch', 0.05), ('rotating', 0.05), ('devise', 0.05), ('bounded', 0.049), ('principal', 0.049), ('feature', 0.049), ('replace', 0.048), ('recall', 0.048), ('bound', 0.048), ('least', 0.048), ('kaufman', 0.047), ('feel', 0.047), ('behaves', 0.047), ('subspaces', 0.047), ('mapped', 0.045), ('observe', 0.045), ('bs', 0.042), ('choice', 0.041), ('sparse', 0.04), ('fast', 0.039), ('batch', 0.039), ('motivated', 0.038), ('svm', 0.037), ('worth', 0.037), ('th', 0.036), ('prove', 0.036), ('columns', 0.035), ('american', 0.035), ('symmetric', 0.034), ('replacing', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 164 nips-2001-Sampling Techniques for Kernel Methods
Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf
Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.
2 0.22625421 170 nips-2001-Spectral Kernel Methods for Clustering
Author: Nello Cristianini, John Shawe-Taylor, Jaz S. Kandola
Abstract: In this paper we introduce new algorithms for unsupervised learning based on the use of a kernel matrix. All the information required by such algorithms is contained in the eigenvectors of the matrix or of closely related matrices. We use two different but related cost functions, the Alignment and the 'cut cost'. The first one is discussed in a companion paper [3], the second one is based on graph theoretic concepts. Both functions measure the level of clustering of a labeled dataset, or the correlation between data clusters and labels. We state the problem of unsupervised learning as assigning labels so as to optimize these cost functions. We show how the optimal solution can be approximated by slightly relaxing the corresponding optimization problem, and how this corresponds to using eigenvector information. The resulting simple algorithms are tested on real world data with positive results. 1
3 0.21365611 58 nips-2001-Covariance Kernels from Bayesian Generative Models
Author: Matthias Seeger
Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1
4 0.2039353 74 nips-2001-Face Recognition Using Kernel Methods
Author: Ming-Hsuan Yang
Abstract: Principal Component Analysis and Fisher Linear Discriminant methods have demonstrated their success in face detection, recognition, and tracking. The representation in these subspace methods is based on second order statistics of the image set, and does not address higher order statistical dependencies such as the relationships among three or more pixels. Recently Higher Order Statistics and Independent Component Analysis (ICA) have been used as informative low dimensional representations for visual recognition. In this paper, we investigate the use of Kernel Principal Component Analysis and Kernel Fisher Linear Discriminant for learning low dimensional representations for face recognition, which we call Kernel Eigenface and Kernel Fisherface methods. While Eigenface and Fisherface methods aim to find projection directions based on the second order correlation of samples, Kernel Eigenface and Kernel Fisherface methods provide generalizations which take higher order correlations into account. We compare the performance of kernel methods with Eigenface, Fisherface and ICA-based methods for face recognition with variation in pose, scale, lighting and expression. Experimental results show that kernel methods provide better representations and achieve lower error rates for face recognition. 1 Motivation and Approach Subspace methods have been applied successfully in numerous visual recognition tasks such as face localization, face recognition, 3D object recognition, and tracking. In particular, Principal Component Analysis (PCA) [20] [13] ,and Fisher Linear Discriminant (FLD) methods [6] have been applied to face recognition with impressive results. While PCA aims to extract a subspace in which the variance is maximized (or the reconstruction error is minimized), some unwanted variations (due to lighting, facial expressions, viewing points, etc.) may be retained (See [8] for examples). It has been observed that in face recognition the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to the changes in face identity [1]. Therefore, while the PCA projections are optimal in a correlation sense (or for reconstruction
5 0.20077184 134 nips-2001-On Kernel-Target Alignment
Author: Nello Cristianini, John Shawe-Taylor, André Elisseeff, Jaz S. Kandola
Abstract: We introduce the notion of kernel-alignment, a measure of similarity between two kernel functions or between a kernel and a target function. This quantity captures the degree of agreement between a kernel and a given learning task, and has very natural interpretations in machine learning, leading also to simple algorithms for model selection and learning. We analyse its theoretical properties, proving that it is sharply concentrated around its expected value, and we discuss its relation with other standard measures of performance. Finally we describe some of the algorithms that can be obtained within this framework, giving experimental results showing that adapting the kernel to improve alignment on the labelled data significantly increases the alignment on the test set, giving improved classification accuracy. Hence, the approach provides a principled method of performing transduction. Keywords: Kernels, alignment, eigenvectors, eigenvalues, transduction 1
6 0.1976894 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
7 0.18982898 136 nips-2001-On the Concentration of Spectral Properties
8 0.17031106 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
9 0.16280083 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
10 0.16137846 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation
11 0.15462719 139 nips-2001-Online Learning with Kernels
12 0.1483527 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
13 0.13909672 155 nips-2001-Quantizing Density Estimators
14 0.13777491 171 nips-2001-Spectral Relaxation for K-means Clustering
15 0.13419047 105 nips-2001-Kernel Machines and Boolean Functions
16 0.13319279 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
17 0.13220927 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
18 0.1300185 56 nips-2001-Convolution Kernels for Natural Language
19 0.11273747 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
20 0.11245688 60 nips-2001-Discriminative Direction for Kernel Classifiers
topicId topicWeight
[(0, -0.354), (1, 0.204), (2, -0.058), (3, -0.17), (4, 0.195), (5, 0.149), (6, 0.019), (7, 0.098), (8, 0.009), (9, 0.091), (10, -0.166), (11, 0.016), (12, 0.017), (13, -0.048), (14, -0.136), (15, -0.016), (16, -0.007), (17, 0.024), (18, -0.014), (19, 0.018), (20, 0.087), (21, -0.032), (22, 0.11), (23, -0.022), (24, -0.098), (25, 0.077), (26, -0.017), (27, 0.056), (28, -0.006), (29, -0.097), (30, -0.053), (31, 0.024), (32, 0.044), (33, 0.027), (34, -0.027), (35, 0.068), (36, -0.059), (37, -0.01), (38, 0.042), (39, -0.012), (40, 0.007), (41, -0.077), (42, -0.025), (43, 0.002), (44, 0.09), (45, -0.024), (46, -0.027), (47, -0.06), (48, -0.032), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.97635496 164 nips-2001-Sampling Techniques for Kernel Methods
Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf
Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.
2 0.79135191 74 nips-2001-Face Recognition Using Kernel Methods
Author: Ming-Hsuan Yang
Abstract: Principal Component Analysis and Fisher Linear Discriminant methods have demonstrated their success in face detection, recognition, and tracking. The representation in these subspace methods is based on second order statistics of the image set, and does not address higher order statistical dependencies such as the relationships among three or more pixels. Recently Higher Order Statistics and Independent Component Analysis (ICA) have been used as informative low dimensional representations for visual recognition. In this paper, we investigate the use of Kernel Principal Component Analysis and Kernel Fisher Linear Discriminant for learning low dimensional representations for face recognition, which we call Kernel Eigenface and Kernel Fisherface methods. While Eigenface and Fisherface methods aim to find projection directions based on the second order correlation of samples, Kernel Eigenface and Kernel Fisherface methods provide generalizations which take higher order correlations into account. We compare the performance of kernel methods with Eigenface, Fisherface and ICA-based methods for face recognition with variation in pose, scale, lighting and expression. Experimental results show that kernel methods provide better representations and achieve lower error rates for face recognition. 1 Motivation and Approach Subspace methods have been applied successfully in numerous visual recognition tasks such as face localization, face recognition, 3D object recognition, and tracking. In particular, Principal Component Analysis (PCA) [20] [13] ,and Fisher Linear Discriminant (FLD) methods [6] have been applied to face recognition with impressive results. While PCA aims to extract a subspace in which the variance is maximized (or the reconstruction error is minimized), some unwanted variations (due to lighting, facial expressions, viewing points, etc.) may be retained (See [8] for examples). It has been observed that in face recognition the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to the changes in face identity [1]. Therefore, while the PCA projections are optimal in a correlation sense (or for reconstruction
3 0.70929801 155 nips-2001-Quantizing Density Estimators
Author: Peter Meinicke, Helge Ritter
Abstract: We suggest a nonparametric framework for unsupervised learning of projection models in terms of density estimation on quantized sample spaces. The objective is not to optimally reconstruct the data but instead the quantizer is chosen to optimally reconstruct the density of the data. For the resulting quantizing density estimator (QDE) we present a general method for parameter estimation and model selection. We show how projection sets which correspond to traditional unsupervised methods like vector quantization or PCA appear in the new framework. For a principal component quantizer we present results on synthetic and realworld data, which show that the QDE can improve the generalization of the kernel density estimator although its estimate is based on significantly lower-dimensional projection indices of the data.
4 0.68947637 58 nips-2001-Covariance Kernels from Bayesian Generative Models
Author: Matthias Seeger
Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1
5 0.68529963 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
Author: Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, Klaus-Robert Müller
Abstract: Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called
6 0.681422 170 nips-2001-Spectral Kernel Methods for Clustering
7 0.66489184 136 nips-2001-On the Concentration of Spectral Properties
8 0.66460496 134 nips-2001-On Kernel-Target Alignment
9 0.63338113 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation
10 0.62648576 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
11 0.61670381 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance
12 0.60864598 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms
13 0.57900846 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
14 0.56136566 113 nips-2001-Learning a Gaussian Process Prior for Automatically Generating Music Playlists
15 0.54045218 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
16 0.53933555 139 nips-2001-Online Learning with Kernels
17 0.52415848 105 nips-2001-Kernel Machines and Boolean Functions
18 0.51963991 56 nips-2001-Convolution Kernels for Natural Language
19 0.50978118 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
20 0.49223191 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
topicId topicWeight
[(14, 0.061), (17, 0.043), (19, 0.018), (27, 0.128), (30, 0.068), (38, 0.014), (59, 0.39), (72, 0.059), (79, 0.028), (83, 0.017), (91, 0.112)]
simIndex simValue paperId paperTitle
1 0.91374069 168 nips-2001-Sequential Noise Compensation by Sequential Monte Carlo Method
Author: K. Yao, S. Nakamura
Abstract: We present a sequential Monte Carlo method applied to additive noise compensation for robust speech recognition in time-varying noise. The method generates a set of samples according to the prior distribution given by clean speech models and noise prior evolved from previous estimation. An explicit model representing noise effects on speech features is used, so that an extended Kalman filter is constructed for each sample, generating the updated continuous state estimate as the estimation of the noise parameter, and prediction likelihood for weighting each sample. Minimum mean square error (MMSE) inference of the time-varying noise parameter is carried out over these samples by fusion the estimation of samples according to their weights. A residual resampling selection step and a Metropolis-Hastings smoothing step are used to improve calculation efficiency. Experiments were conducted on speech recognition in simulated non-stationary noises, where noise power changed artificially, and highly non-stationary Machinegun noise. In all the experiments carried out, we observed that the method can have significant recognition performance improvement, over that achieved by noise compensation with stationary noise assumption. 1
same-paper 2 0.89797866 164 nips-2001-Sampling Techniques for Kernel Methods
Author: Dimitris Achlioptas, Frank Mcsherry, Bernhard Schölkopf
Abstract: We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained approximations. Rather intriguingly, all three techniques can be viewed as instantiations of the following idea: replace the kernel function by a “randomized kernel” which behaves like in expectation.
3 0.89592624 108 nips-2001-Learning Body Pose via Specialized Maps
Author: Rómer Rosales, Stan Sclaroff
Abstract: A nonlinear supervised learning model, the Specialized Mappings Architecture (SMA), is described and applied to the estimation of human body pose from monocular images. The SMA consists of several specialized forward mapping functions and an inverse mapping function. Each specialized function maps certain domains of the input space (image features) onto the output space (body pose parameters). The key algorithmic problems faced are those of learning the specialized domains and mapping functions in an optimal way, as well as performing inference given inputs and knowledge of the inverse function. Solutions to these problems employ the EM algorithm and alternating choices of conditional independence assumptions. Performance of the approach is evaluated with synthetic and real video sequences of human motion. 1
4 0.86485964 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity
Author: Lehel Csató, Manfred Opper, Ole Winther
Abstract: The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka’s expectation propagation. Lastly, we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification and density estimation with Gaussian processes and on an independent component analysis problem.
5 0.63768727 74 nips-2001-Face Recognition Using Kernel Methods
Author: Ming-Hsuan Yang
Abstract: Principal Component Analysis and Fisher Linear Discriminant methods have demonstrated their success in face detection, recognition, and tracking. The representation in these subspace methods is based on second order statistics of the image set, and does not address higher order statistical dependencies such as the relationships among three or more pixels. Recently Higher Order Statistics and Independent Component Analysis (ICA) have been used as informative low dimensional representations for visual recognition. In this paper, we investigate the use of Kernel Principal Component Analysis and Kernel Fisher Linear Discriminant for learning low dimensional representations for face recognition, which we call Kernel Eigenface and Kernel Fisherface methods. While Eigenface and Fisherface methods aim to find projection directions based on the second order correlation of samples, Kernel Eigenface and Kernel Fisherface methods provide generalizations which take higher order correlations into account. We compare the performance of kernel methods with Eigenface, Fisherface and ICA-based methods for face recognition with variation in pose, scale, lighting and expression. Experimental results show that kernel methods provide better representations and achieve lower error rates for face recognition. 1 Motivation and Approach Subspace methods have been applied successfully in numerous visual recognition tasks such as face localization, face recognition, 3D object recognition, and tracking. In particular, Principal Component Analysis (PCA) [20] [13] ,and Fisher Linear Discriminant (FLD) methods [6] have been applied to face recognition with impressive results. While PCA aims to extract a subspace in which the variance is maximized (or the reconstruction error is minimized), some unwanted variations (due to lighting, facial expressions, viewing points, etc.) may be retained (See [8] for examples). It has been observed that in face recognition the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to the changes in face identity [1]. Therefore, while the PCA projections are optimal in a correlation sense (or for reconstruction
6 0.61811304 154 nips-2001-Products of Gaussians
7 0.60980529 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
8 0.60510933 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
9 0.59518754 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
10 0.59489959 71 nips-2001-Estimating the Reliability of ICA Projections
11 0.59230357 155 nips-2001-Quantizing Density Estimators
12 0.58406818 44 nips-2001-Blind Source Separation via Multinode Sparse Representation
13 0.58009422 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
14 0.57878411 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation
15 0.57731026 179 nips-2001-Tempo tracking and rhythm quantization by sequential Monte Carlo
16 0.57455277 84 nips-2001-Global Coordination of Local Linear Models
17 0.57284784 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
18 0.57004946 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
19 0.56963158 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
20 0.56944591 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes