nips nips2000 nips2000-121 knowledge-graph by maker-knowledge-mining

121 nips-2000-Sparse Kernel Principal Component Analysis

Source: pdf

Author: Michael E. Tipping

Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. [sent-5, score-1.228]

2 Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. [sent-6, score-0.175]

3 This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. [sent-7, score-1.073]

4 1 Introduction Principal component analysis (PCA) is a well-established technique for dimensionality reduction, and examples of its many applications include data compression, image processing, visualisation, exploratory data analysis, pattern recognition and time series prediction. [sent-8, score-0.314]

5 Given a set of N d-dimensional data vectors X n , which we take to have zero mean, the principal components are the linear projections onto the 'principal axes', defined as the leading eigenvectors of the sample covariance matrix S = N-1Z=:=lXnX~ = N-1XTX, where X = (Xl,X2, . [sent-9, score-0.982]

6 These projections are of interest as they retain maximum variance and minimise error of subsequent linear reconstruction. [sent-13, score-0.234]

7 However, because PCA only defines a linear projection of the data, the scope of its application is necessarily somewhat limited. [sent-14, score-0.174]

8 This has naturally motivated various developments of nonlinear 'principal component analysis' in an effort to model non-trivial data structures more faithfully, and a particularly interesting recent innovation has been 'kernel PCA' [4]. [sent-15, score-0.301]

9 Kernel PCA, summarised in Section 2, makes use of the 'kernel trick', so effectively exploited by the 'support vector machine', in that a kernel function k(·,·) may be considered to represent a dot (inner) product in some transformed space if it satisfies Mercer's condition - i. [sent-16, score-0.491]

10 if it is the continuous symmetric kernel of a positive integral operator. [sent-18, score-0.375]

11 This can be an elegant way to 'non-linearise' linear procedures which depend only on inner products of the examples. [sent-19, score-0.288]

12 Applications utilising kernel PCA are emerging [2], but in practice the approach suffers from one important disadvantage in that it is not a sparse method. [sent-20, score-0.553]

13 Computation of principal component projections for a given input x requires evaluation of the kernel function k(x, xn) in respect of all N 'training' examples Xn. [sent-21, score-0.815]

14 This is an unfortunate limitation as in practice, to obtain the best model, we would like to estimate the kernel principal components from as much data as possible. [sent-22, score-0.705]

15 Here we tackle this problem by first approximating the covariance matrix in feature space by a subset of outer products of feature vectors, using a maximum-likelihood criterion based on a 'probabilistic PCA' model detailed in Section 3. [sent-23, score-0.913]

16 Importantly, the approximation we adopt is principled and controllable, and is related to the choice of the number of components to 'discard' in the conventional approach. [sent-25, score-0.052]

17 We demonstrate its efficacy in Section 4 and illustrate how it can offer similar performance to a full non-sparse kernel PCA implementation while offering much reduced computational overheads. [sent-26, score-0.507]

18 2 Kernel peA Although PCA is conventionally defined (as above) in terms of the covariance, or outer-product, matrix, it is well-established that the eigenvectors of XTX can be obtained from those of the inner-product matrix XXT. [sent-27, score-0.331]

19 If V is an orthogonal matrix of column eigenvectors of XX T with corresponding eigenvalues in the diagonal matrix A, then by definition (XXT)V = VA. [sent-28, score-0.602]

20 (1) From inspection, it can be seen that the eigenvectors of XTX are XTV, with eigenvalues A. [sent-30, score-0.24]

21 Note, however, that the column vectors XTV are not normalised since for column i, llTXXTll i = AillTlli = Ai, so the correctly normalised eigenvectors of 1 XTX, and thus the principal axes of the data, are given by Vpca = XTVA -'. [sent-31, score-0.733]

22 This derivation is useful if d > N, when the dimensionality of x is greater than the number of examples, but it is also fundamental for implementing kernel PCA. [sent-32, score-0.447]

23 In kernel PCA, the data vectors Xn are implicitly mapped into a feature space by a set of functions {ifJ} : Xn -+ 4>(xn). [sent-33, score-0.795]

24 Although the vectors 4>n = 4>(xn) in the feature space are generally not known explicitly, their inner products are defined by the kernel: 4>-:n4>n = k(xm, xn). [sent-34, score-0.534]

25 Although we can't compute Vkpca since we don't know cp explicitly, we can compute projections of arbitrary test vectors x* -+ 4>* onto Vkpca in feature space: 4>~Vkpca = 4>~cpTVA -~ = k~VA-~, (3) where k* is the N -vector of inner products of x* with the data in kernel space: (k)n = k(x*,x n). [sent-36, score-1.236]

26 We can thus compute, and plot, these projections - Figure 1 gives an example for some synthetic 3-cluster data in two dimensions. [sent-37, score-0.23]

27 lHere, and in the rest of the paper, we do not 'centre' the data in feature space, although this may be achieved if desired (see [4]). [sent-38, score-0.235]

28 In fact, we would argue that when using a Gaussian kernel, it does not necessarily make sense to do so. [sent-39, score-0.07]

29 '" Figure 1: Contour plots of the first nine principal component projections evaluated over a region of input space for data from 3 Gaussian clusters (standard deviation 0. [sent-88, score-0.599]

30 Note how the first three components 'pick out' the individual clusters [4]. [sent-93, score-0.109]

31 3 Probabilistic Feature-Space peA Our approach to sparsifying kernel peA is to a priori approximate the feature space sample covariance matrix Sq, with a sum of weighted outer products of a reduced number of feature vectors. [sent-94, score-1.293]

32 (The basis of this technique is thus general and its application not necessarily limited to kernel peA. [sent-95, score-0.491]

33 ) This is achieved probabilistically, by maximising the likelihood of the feature vectors under a Gaussian density model ¢ ~ N(O, C) , where we specify the covariance C by: N C = (721 + L Wi¢i¢r = (721 + c)TWC), (4) i=1 where W1 . [sent-96, score-0.507]

34 WN are the adjustable weights, W is a matrix with those weights on the diagonal, and (72 is an isotropic 'noise' component common to all dimensions of feature space. [sent-99, score-0.473]

35 Of course, a naive maximum of the likelihood under this model is obtained with (72 = a and all Wi = 1/N. [sent-100, score-0.065]

36 However, if we fix (72, and optimise only the weighting factors Wi, we will find that the maximum-likelihood estimates of many Wi are zero, thus realising a sparse representation of the covariance matrix. [sent-101, score-0.392]

37 This probabilistic approach is motivated by the fact that if we relax the form of the model, by defining it in terms of outer products of N arbitrary vectors Vi (rather than the fixed training vectors), i. [sent-102, score-0.474]

38 That is, if {Ui' Ai} are the set of eigenvectors/values of Sq" then the likelihood under this model is maximised by Vi = Ui and Wi = (Ai _(72)1/2, for those i for which Ai > (72. [sent-105, score-0.103]

39 For Ai :::; (72, the most likely weights Wi are zero. [sent-106, score-0.056]

40 1 Computations in feature space We wish to maximise the likelihood under a Gaussian model with covariance given by (4). [sent-108, score-0.531]

41 Ignoring terms independent of the weighting parameters, its log is given by: (5) Computing (5) requires the quantities ICI and (VC-1rP, which for infinite dimensionality feature spaces might appear problematic. [sent-109, score-0.471]

42 However, by judicious re-writing of the terms of interest, we are able to both compute the log-likelihood (to within a constant) and optimise it with respect to the weights. [sent-110, score-0.211]

43 First, we can write: log 1(T21 + 4)TW4) I = D log (T2 + log IW- 1 + (T-24)4)TI + log IWI. [sent-111, score-0.224]

44 (6) The potential problem of infinite dimensionality, D, of the feature space now enters only in the first term, which is constant if (T2 is fixed and so does not affect maximisation. [sent-112, score-0.309]

45 The term in IWI is straightforward and the remaining term can be expressed in terms of the inner-product (kernel) matrix: W- 1 + (T-24)4)T = W- 1 + (T-2K, (7) where K is the kernel matrix such that (K)mn = k(xm , xn). [sent-113, score-0.665]

46 For the data-dependent term in the likelihood, we can use the Woodbury matrix inversion identity to compute the quantities rP~C-lrPn: rP~((T21 with k n 3. [sent-114, score-0.315]

47 ,k(xn, XN )r· (8) Optimising the weights To maximise the log-likelihood with respect to the {)C {)Wi = ! [sent-118, score-0.179]

48 _ 2 '1', '1', (t M~i + = 2~2 , Wi, differentiating (5) gives us: NA. [sent-122, score-0.08]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kernel', 0.375), ('pca', 0.362), ('vkpca', 0.226), ('xtv', 0.226), ('wi', 0.209), ('xn', 0.208), ('eigenvectors', 0.165), ('principal', 0.164), ('feature', 0.158), ('projections', 0.155), ('xtx', 0.153), ('covariance', 0.153), ('matrix', 0.129), ('pea', 0.126), ('products', 0.113), ('rp', 0.109), ('inner', 0.103), ('sparse', 0.098), ('vectors', 0.093), ('outer', 0.091), ('maximise', 0.088), ('optimise', 0.088), ('component', 0.086), ('sq', 0.082), ('ai', 0.078), ('eigenvalues', 0.075), ('mn', 0.072), ('elegant', 0.072), ('dimensionality', 0.072), ('necessarily', 0.07), ('axes', 0.069), ('space', 0.067), ('defines', 0.066), ('implicitly', 0.066), ('cp', 0.066), ('likelihood', 0.065), ('normalised', 0.063), ('column', 0.058), ('clusters', 0.057), ('log', 0.056), ('weights', 0.056), ('weighting', 0.053), ('components', 0.052), ('quantities', 0.052), ('xm', 0.052), ('defining', 0.052), ('compute', 0.051), ('ui', 0.05), ('motivated', 0.05), ('reduced', 0.049), ('summarised', 0.049), ('iwi', 0.049), ('fe', 0.049), ('george', 0.049), ('offering', 0.049), ('nonlinear', 0.047), ('diagonal', 0.046), ('technique', 0.046), ('approximating', 0.044), ('emerging', 0.044), ('isotropic', 0.044), ('inspection', 0.044), ('innovation', 0.044), ('xx', 0.044), ('unfortunate', 0.044), ('wherein', 0.044), ('xxt', 0.044), ('probabilistically', 0.044), ('infinite', 0.043), ('term', 0.042), ('st', 0.042), ('although', 0.041), ('controllable', 0.041), ('inversion', 0.041), ('retain', 0.041), ('enters', 0.041), ('optimising', 0.041), ('differentiating', 0.041), ('guildhall', 0.041), ('expressed', 0.04), ('gives', 0.039), ('relax', 0.038), ('minimise', 0.038), ('maximised', 0.038), ('maximising', 0.038), ('scope', 0.038), ('exploratory', 0.038), ('tipping', 0.038), ('developments', 0.038), ('gaussian', 0.038), ('explicitly', 0.038), ('terms', 0.037), ('suffers', 0.036), ('data', 0.036), ('respect', 0.035), ('onto', 0.035), ('vi', 0.035), ('offer', 0.034), ('nine', 0.034), ('limitation', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

2 0.26042011 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

Author: Christopher K. I. Williams

Abstract: In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed. 1

3 0.23350225 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1

4 0.22901618 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

5 0.2004189 27 nips-2000-Automatic Choice of Dimensionality for PCA

Author: Thomas P. Minka

Abstract: A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster.

6 0.15269658 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

7 0.13701099 120 nips-2000-Sparse Greedy Gaussian Process Regression

8 0.1348311 133 nips-2000-The Kernel Gibbs Sampler

9 0.12241925 110 nips-2000-Regularization with Dot-Product Kernels

10 0.12185201 74 nips-2000-Kernel Expansions with Unlabeled Examples

11 0.11803808 4 nips-2000-A Linear Programming Approach to Novelty Detection

12 0.11368646 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

13 0.11239223 49 nips-2000-Explaining Away in Weight Space

14 0.10935637 75 nips-2000-Large Scale Bayes Point Machines

15 0.10645082 122 nips-2000-Sparse Representation for Gaussian Process Models

16 0.10617781 51 nips-2000-Factored Semi-Tied Covariance Matrices

17 0.10082304 58 nips-2000-From Margin to Sparsity

18 0.097685724 54 nips-2000-Feature Selection for SVMs

19 0.093663119 35 nips-2000-Computing with Finite and Infinite Networks

20 0.090118736 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.29), (1, 0.161), (2, -0.028), (3, 0.182), (4, -0.032), (5, 0.429), (6, -0.152), (7, 0.006), (8, -0.042), (9, -0.092), (10, -0.056), (11, -0.139), (12, -0.123), (13, 0.039), (14, -0.1), (15, 0.017), (16, 0.103), (17, 0.079), (18, -0.015), (19, 0.086), (20, 0.098), (21, 0.104), (22, 0.001), (23, -0.076), (24, -0.047), (25, 0.124), (26, 0.08), (27, -0.002), (28, -0.026), (29, 0.028), (30, -0.021), (31, -0.002), (32, -0.002), (33, -0.044), (34, 0.035), (35, 0.01), (36, 0.021), (37, -0.063), (38, -0.051), (39, -0.006), (40, -0.122), (41, 0.036), (42, 0.006), (43, -0.041), (44, 0.05), (45, 0.015), (46, -0.016), (47, 0.03), (48, 0.025), (49, -0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98378062 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

2 0.86535901 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

Author: Christopher K. I. Williams

3 0.69466507 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

4 0.63434291 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

5 0.60552984 27 nips-2000-Automatic Choice of Dimensionality for PCA

Author: Thomas P. Minka

6 0.56019145 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

7 0.49292311 110 nips-2000-Regularization with Dot-Product Kernels

8 0.47475839 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

9 0.45233402 120 nips-2000-Sparse Greedy Gaussian Process Regression

10 0.42748475 133 nips-2000-The Kernel Gibbs Sampler

11 0.42134964 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

12 0.41025424 51 nips-2000-Factored Semi-Tied Covariance Matrices

13 0.38536102 54 nips-2000-Feature Selection for SVMs

14 0.37689933 74 nips-2000-Kernel Expansions with Unlabeled Examples

15 0.37559766 49 nips-2000-Explaining Away in Weight Space

16 0.35840583 4 nips-2000-A Linear Programming Approach to Novelty Detection

17 0.3152009 75 nips-2000-Large Scale Bayes Point Machines

18 0.30308104 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding

19 0.29619822 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

20 0.29198775 12 nips-2000-A Support Vector Method for Clustering

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.018), (17, 0.708), (33, 0.029), (55, 0.011), (62, 0.016), (67, 0.03), (75, 0.015), (76, 0.043), (90, 0.027), (97, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99887091 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

2 0.99553621 54 nips-2000-Feature Selection for SVMs

Author: Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik

Abstract: We introduce a method of feature selection for Support Vector Machines. The method is based upon finding those features which minimize bounds on the leave-one-out error. This search can be efficiently performed via gradient descent. The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA micro array data.

3 0.99482369 135 nips-2000-The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference

Author: James M. Coughlan, Alan L. Yuille

Abstract: Preliminary work by the authors made use of the so-called

4 0.99061632 56 nips-2000-Foundations for a Circuit Complexity Theory of Sensory Processing

Author: Robert A. Legenstein, Wolfgang Maass

Abstract: We introduce total wire length as salient complexity measure for an analysis of the circuit complexity of sensory processing in biological neural systems and neuromorphic engineering. This new complexity measure is applied to a set of basic computational problems that apparently need to be solved by circuits for translation- and scale-invariant sensory processing. We exhibit new circuit design strategies for these new benchmark functions that can be implemented within realistic complexity bounds, in particular with linear or almost linear total wire length.

5 0.98976326 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes

Author: Te-Won Lee, Thomas Wachtler, Terrence J. Sejnowski

Abstract: The human visual system encodes the chromatic signals conveyed by the three types of retinal cone photoreceptors in an opponent fashion. This color opponency has been shown to constitute an efficient encoding by spectral decorrelation of the receptor signals. We analyze the spatial and chromatic structure of natural scenes by decomposing the spectral images into a set of linear basis functions such that they constitute a representation with minimal redundancy. Independent component analysis finds the basis functions that transforms the spatiochromatic data such that the outputs (activations) are statistically as independent as possible, i.e. least redundant. The resulting basis functions show strong opponency along an achromatic direction (luminance edges), along a blueyellow direction, and along a red-blue direction. Furthermore, the resulting activations have very sparse distributions, suggesting that the use of color opponency in the human visual system achieves a highly efficient representation of colors. Our findings suggest that color opponency is a result of the properties of natural spectra and not solely a consequence of the overlapping cone spectral sensitivities. 1 Statistical structure of natural scenes Efficient encoding of visual sensory information is an important task for information processing systems and its study may provide insights into coding principles of biological visual systems. An important goal of sensory information processing Electronic version available at www. cnl. salk . edu/

6 0.91572154 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

7 0.8468681 130 nips-2000-Text Classification using String Kernels

8 0.84185165 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

9 0.82735485 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

10 0.82034069 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

11 0.81556052 4 nips-2000-A Linear Programming Approach to Novelty Detection

12 0.81111526 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images

13 0.80464888 133 nips-2000-The Kernel Gibbs Sampler

14 0.80432343 51 nips-2000-Factored Semi-Tied Covariance Matrices

15 0.79852432 82 nips-2000-Learning and Tracking Cyclic Human Motion

16 0.79450029 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

17 0.79015064 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

18 0.78776079 118 nips-2000-Smart Vision Chip Fabricated Using Three Dimensional Integration Technology

19 0.78758365 36 nips-2000-Constrained Independent Component Analysis

20 0.7855401 79 nips-2000-Learning Segmentation by Random Walks