nips nips2002 nips2002-138 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pascal Vincent, Yoshua Bengio
Abstract: The similarity between objects is a fundamental element of many learning algorithms. Most non-parametric methods take this similarity to be fixed, but much recent work has shown the advantages of learning it, in particular to exploit the local invariances in the data or to capture the possibly non-linear manifold on which most of the data lies. We propose a new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices. Experiments in density estimation show significant improvements with respect to Parzen density estimators. The density estimators can also be used within Bayes classifiers, yielding classification rates similar to SVMs and much superior to the Parzen classifier.
Reference: text
sentIndex sentText sentNum sentScore
1 Most non-parametric methods take this similarity to be fixed, but much recent work has shown the advantages of learning it, in particular to exploit the local invariances in the data or to capture the possibly non-linear manifold on which most of the data lies. [sent-10, score-0.319]
2 We propose a new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices. [sent-11, score-0.929]
3 Experiments in density estimation show significant improvements with respect to Parzen density estimators. [sent-12, score-0.399]
4 The density estimators can also be used within Bayes classifiers, yielding classification rates similar to SVMs and much superior to the Parzen classifier. [sent-13, score-0.211]
5 1 Introduction In [1], while attempting to better understand and bridge the gap between the good performance of the popular Support Vector Machines and the more traditional K-NN (K Nearest Neighbors) for classification problems, we had suggested a modified Nearest-Neighbor algorithm. [sent-14, score-0.041]
6 This algorithm, which was able to slightly outperform SVMs on several realworld problems, was based on the geometric intuition that the classes actually lived “close to” a lower dimensional non-linear manifold in the high dimensional input space. [sent-15, score-0.379]
7 When this was not properly taken into account, as with traditional K-NN, the sparsity of the data points due to having a finite number of training samples would cause “holes” or “zig-zag” artifacts in the resulting decision surface, as illustrated in Figure 1. [sent-16, score-0.172]
8 Figure 1: A local view of the decision surface, with “holes”, produced by the Nearest Neighbor when the data have a local structure (horizontal direction). [sent-17, score-0.15]
9 The present work is based on the same underlying geometric intuition, but applied to the well known Parzen windows [2] non-parametric method for density estimation, using Gaussian kernels. [sent-18, score-0.452]
10 with a diagonal covariance matrix, or even a “full Gaussian” with a full covariance matrix, usually set to be proportional to the global empirical covariance of the ¢ £¡ training data. [sent-22, score-0.418]
11 However these are equivalent to using a spherical Gaussian on preprocessed, normalized data (i. [sent-23, score-0.113]
12 normalized by subtracting the empirical sample mean, and multiplying by the inverse sample covariance). [sent-25, score-0.024]
13 Whatever the shape of the kernel, if, as is customary, a fixed shape is used, merely centered on every training point, the shape can only compensate for the global structure (such as global covariance) of the data. [sent-26, score-0.137]
14 This is likely to result in an excessive “bumpyness” of the thus modeled density, much like the “holes” and “zig-zag” artifacts observed in KNN (see Fig. [sent-29, score-0.071]
15 The resulting model is a mixture of Gaussian “pancakes”, similar to [3], mixtures of probabilistic PCAs [4] or mixtures of factor analyzers [5, 6], in the same way that the most traditional Parzen Windows is a mixture of spherical Gaussians. [sent-33, score-0.35]
16 But it remains a memory-based method, with a Gaussian kernel centered on each training points, yet with a differently shaped kernel for each point. [sent-34, score-0.201]
17 Let be an -dimensional random variable with values in , and an unknown probability density function . [sent-36, score-0.18]
18 Our training set contains samples of that random variable, collected in a matrix whose row is the -th sample. [sent-37, score-0.098]
19 ) c c c b` ` X U S d0a) WYWVT 1 R '¤F g g Vf CE I 8 q ph h iI 9 e rE 0iI 9 e 2 ¤F CE P8 I : A CA 9 £E DB@8 ('£$ & % ¦ " £ ) where matrix (1) and covariance (2) where is the determinant of . [sent-40, score-0.147]
20 From the above discussion, we expect that if there is an underlying “non-linear principal manifold”, those gaussians would be “pancakes” aligned with the plane locally tangent to this underlying manifold. [sent-42, score-0.353]
21 The only available information (in the absence of further prior knowledge) about this tangent plane can be gathered from the training samples int the neighborhood of . [sent-43, score-0.259]
22 In other words, we are interested in computing the principal directions of the samples in the neighborhood of . [sent-44, score-0.262]
23 ¡ ) ¡ ` a `) ¡ £ For generality, we can define a soft neighborhood of with a neighborhood kernel that will associate an influence weight to any point in the neighborhood of ¡ ¡ F t s . [sent-45, score-0.347]
24 could be a spherical Gaussian centered on for instance, or any other positive definite kernel, possibly incorporating prior knowledge as to what constitutes a reasonable neighborhood for point . [sent-47, score-0.243]
25 Notice that if is a constant (uniform kernel), is the global training sample covariance. [sent-48, score-0.08]
26 In that case, is the unweighted covariance of the nearest neighbors of . [sent-50, score-0.282]
27 Now that we have a way of computing a local covariance matrix for each training point, we might be tempted to use this directly in equations 2 and 1. [sent-52, score-0.278]
28 But a number of problems must first be addressed: A Equation 2 requires the inverse covariance matrix, whereas is likely to be illconditioned. [sent-53, score-0.129]
29 This situation will definitely arise if we use a hard k-neighborhood with . [sent-54, score-0.028]
30 In this case we get a Gaussian that is totally flat outside of the affine subspace spanned by and its neighbors, and it does not constitute a proper density in . [sent-55, score-0.18]
31 A common way to deal with this problem is to add a small isotropic (spherical) Gaussian noise of variance in all directions, which is done by simply adding to the diagonal of . [sent-56, score-0.105]
32 the covariance matrix: ) © ¨§ ¢ ¡ ¢ ¡ $ " %¢ ¡ #A ¡) 1 ¡ ) ¦ ! [sent-57, score-0.105]
33 ¡ ¢ ¡ Even if we regularize by adding , when we deal with high dimensional spaces, it would be prohibitive in computation time and storage to keep and use the full inverse covariance matrix as expressed in 2. [sent-58, score-0.259]
34 This would in effect multiply both the time and storage requirement of the already expensive ordinary Parzen Windows by . [sent-59, score-0.209]
35 So instead, we use a different, more compact representation of the inverse Gaussian, by storing only the eigenvectors associated with the first few largest eigenvalues of , as described below. [sent-60, score-0.144]
36 2 " ¡ ) ¦ ¡ ) can be expressed as: , The eigen-decomposition of a covariance matrix where the columns of are the orthonormal eigenvectors and is a diagonal matrix with the eigenvalues , that we will suppose sorted in decreasing order, without loss of generality. [sent-61, score-0.332]
37 & ( & 10)'1 ) ) ( X 2 3 7 2 & The first eigenvectors with largest eigenvalues correspond to the principal directions of the local neighborhood, i. [sent-62, score-0.36]
38 the high variance local directions of the supposed underlying -dimensional manifold (but the true underlying dimension is unknown and may actually vary across space). [sent-64, score-0.534]
39 The last few eigenvalues and eigenvectors are but noise directions with a small variance. [sent-65, score-0.244]
40 So we may, without too much risk, force those last few components to the same low noise level . [sent-66, score-0.029]
41 We have done this by zeroing the last eigenvalues (by considering only the first leading eigenvalues) and then adding to all eigenvalues. [sent-67, score-0.092]
42 Thus both the storage requirement and the computational cost when estimating the density at a test point is only about times that of ordinary Parzen. [sent-69, score-0.398]
43 Note that with , we trivially obtain the traditional Parzen windows estimator. [sent-73, score-0.266]
44 " Algorithm MParzen::Train( ) © ¨§ ¡ ¡ Input: training set matrix with rows , chosen number of principal directions , chosen number of neighbors , and regularization hyper-parameter . [sent-75, score-0.359]
45 (1) For (2) Collect nearest neighbors of , and put in the rows of matrix . [sent-76, score-0.219]
46 (3) Perform a partial singular value decomposition of , to obtain the leading singular values ( ) and singular column vectors of . [sent-77, score-0.104]
47 ¢ ¡ 2 2 £ ¡ 2 7 8¡ & 4 ¦ & AG B " 4 2 2 & " A ¢ ¡ G 4 G¤¦G¢¦G ¦G F 063G G 2 )¡ 4 4 3 ¢ ¡ 1 £¡ 2 ( f 9@" 0 4 63G G 4 3 2 (5¡ £ ¡ 0 1 G G S G 2 )¡ # ( 4 , where is an tensor that matrix with all the eigenvalues. [sent-78, score-0.042]
48 £ Algorithm MParzen::Test( ¢ ¡ G 4 G 3W3'3G 0 1 G 2G & " Input: test point and model (1) (2) For (3) LocalGaussian( Output: manifold Parzen estimator 4 is a £ 4 ' & (4) For , let Output: The model = collects all the eigenvectors and ) . [sent-79, score-0.407]
49 A 3 C FE3 0 1 G G S G " 2 )¡ # ( D3 C 3 Related work As we have already pointed out, Manifold Parzen Windows, like traditional Parzen Windows and so many other density estimation algorithms, results in defining the density as a mixture of Gaussians. [sent-81, score-0.482]
50 What differs is mostly how those Gaussians and their parameters are chosen. [sent-82, score-0.031]
51 The idea of having a parameterization of each Gaussian that orients it along the local principal directions also underlies the already mentioned work on mixtures of Gaussian pancakes [3], mixtures of probabilistic PCAs [4], and mixtures of factor analysers [5, 6]. [sent-83, score-0.58]
52 All these algorithms typically model the density using a relatively small number of Gaussians, whose centers and parameters must be learnt with some iterative optimisation algorithm such as EM (procedures which are known to be sensitive to local minima traps). [sent-84, score-0.255]
53 It avoids the problem of optimizing the centers by assigning a Gaussian to every training point, and uses simple analytic SVD to compute the local principal directions for each. [sent-86, score-0.322]
54 Another successful memory-based approach that uses local directions and inspired our work is the tangent distance algorithm [7]. [sent-87, score-0.28]
55 While this approach was initially aimed at solving classification tasks with a nearest neighbor paradigm, some work has already been done in developing it into a probabilistic interpretation for mixtures with a few gaussians, as well as for full-fledged kernel density estimation [8, 9]. [sent-88, score-0.56]
56 The main difference between our approach and the above is that the Manifold Parzen estimator does not require prior knowledge, as it infers the local directions directly from the data, although it should be easy to also incorporate prior knowledge if available. [sent-89, score-0.234]
57 We should also mention similarities between our approach and the Local Linear Embedding and recent related dimensionality reduction methods [10, 11, 12, 13]. [sent-90, score-0.039]
58 Lastly, it can also be seen as an extension along the line of traditional variable and adaptive kernel estimators that adapt the kernel width locally (see [18] for a survey). [sent-92, score-0.278]
59 4 Experimental results Throughout this whole section, when we mention Parzen Windows (sometimes abbreviated Parzen ), we mean ordinary Parzen windows using a spherical Gaussian kernel with a single hyper-parameter , the width of the Gaussian. [sent-93, score-0.64]
60 ¡ When we mention Manifold Parzen Windows (sometimes abbreviated MParzen), we used a hard k-neighborhood, so that the hyper-parameters are: the number of neighbors , the number of retained principal components , and the additional isotropic Gaussian noise parameter . [sent-94, score-0.301]
61 4 , we used the average negative log examples from a test set. [sent-95, score-0.031]
62 £0$ ¡ ¡ When measuring the quality of a density estimator likelihood: ANLL with the % 1 % ¢ 7 ¡ F a£ 7 5 ¡ $ § ¥ 4. [sent-96, score-0.244]
63 1 Experiment on 2D artificial data A training set of 300 points, a validation set of 300 points and a test set of 10000 points points: were generated from the following distribution of two dimensional " 4£ 3 G 2 ¥ § ¥ is uniform in the interval 2 $ 9 G 1 ) 2 ! [sent-97, score-0.276]
64 ¢F ¡ G where and ) $ % ¥ 3G 2 H#£ We trained an ordinary Parzen, as well as MParzen with and on the training set, tuning the hyper-parameters to achieve best performance on the validation set. [sent-100, score-0.271]
65 Figure 2 shows the training set and gives a good idea of the densities produced by both kinds of algorithms (as the visual representation for MParzen with and did not appear very different, we show only the case ). [sent-101, score-0.056]
66 The graphic reveals the anticipated “bumpyness” artifacts of ordinary Parzen, and shows that MParzen is indeed able to better concentrate the probability density along the manifold, even when the training data is scarce. [sent-102, score-0.444]
67 S @4 1 1 @4 S 1 4 2 1 4 2 1 4 Quantitative comparative results of the two models are reported in table 1 Table 1: Comparative results on the artificial data (standard errors are in parenthesis). [sent-103, score-0.081]
68 009) Several points are worth noticing: Both MParzen models seem to achieve a lower ANLL than ordinary Parzen (even though the underlying manifold really has dimension ), and with more consistency over the test sets (lower standard error). [sent-111, score-0.496]
69 The optimal width for ordinary Parzen is much larger than the noise parameter of the true generating model (0. [sent-112, score-0.231]
70 2 1 4 ¡ 2 The optimal regularization parameter for MParzen with (i. [sent-114, score-0.025]
71 supposing a one-dimensional underlying manifold) is very close to the actual noise parameter of the true generating model. [sent-116, score-0.143]
72 This suggests that it was able to capture the underlying structure quite well. [sent-117, score-0.047]
73 Also it is the best of the three models, which is not surprising, since the true model is indeed a one dimensional manifold with an added isotropic Gaussian noise. [sent-118, score-0.34]
74 The optimal additional noise parameter for MParzen with (i. [sent-119, score-0.029]
75 supposing a two-dimensional underlying manifold) is close to 0, which suggests that the model was able to capture all the noise in the second “principal direction”. [sent-121, score-0.117]
76 ¡ 1 4 S 1 4 ¡ Figure 2: Illustration of the density estimated by ordinary Parzen Windows (left) and Manifold Parzen Windows (right). [sent-122, score-0.317]
77 The 300 training points are represented as black dots and the area where the estimated density is above 1. [sent-124, score-0.273]
78 The excessive “bumpyness” and holes produced by ordinary Parzen windows model can clearly be seen, whereas Manifold Parzen density is better aligned with the underlying manifold, allowing it to even successfully “extrapolate” in regions with few data points but high true density. [sent-126, score-0.777]
79 2 Density estimation on OCR data In order to compare the performance of both algorithms for density estimation on a realworld problem, we estimated the density of one class of the MNIST OCR data set, namely the “2” digit. [sent-128, score-0.479]
80 The available data for this class was divided into 5400 training points, 558 validation points and 1032 test points. [sent-129, score-0.202]
81 Note that the improvement with respect to Parzen windows is extremely large and of course statistically significant. [sent-132, score-0.225]
82 Table 2: Density estimation of class ’2’ in the MNIST data set. [sent-133, score-0.039]
83 Algorithm Parameters used validation ANLL test ANLL Parzen -197. [sent-135, score-0.109]
84 3 Classification performance To obtain a probabilistic classifier with a density estimator we train an estimator for each class , and apply Bayes’ rule to obtain . [sent-144, score-0.33]
85 ¡ ` ¡ $ £ 7 5 ¡ § ¥ % 7% ¢ £ ` $ 1 This method was applied to both the Parzen and the Manifold Parzen density estimators, which were compared with state-of-the-art Gaussian SVMs on the full USPS data set. [sent-146, score-0.18]
86 The original training set (7291) was split into a training (first 6291) and validation set (last 1000), used to tune hyper-parameters. [sent-147, score-0.19]
87 The classification errors for all three methods are compared in Table 3, where the hyper-parameters are chosen based on validation classification error. [sent-148, score-0.078]
88 The log-likelihoods are compared in Table 4, where the hyper-parameters are chosen based on validation ANCLL. [sent-149, score-0.078]
89 Hyper-parameters for SVMs are the box constraint and the Gaussian width . [sent-150, score-0.039]
90 ) ¡ Table 3: Classification error obtained on USPS with SVM, Parzen windows and Manifold Parzen windows classifiers. [sent-152, score-0.45]
91 1 ¢ ¡ 2 2 1 2 2 1 4 2 1 ¡ ¡ 2 1 ) validation error 1. [sent-157, score-0.078]
92 They have allowed to show the usefulness of learning the local structure of the data through a regularized covariance matrix estimated for each data point. [sent-167, score-0.242]
93 By taking advantage of local structure, the new kernel density estimation method outperforms the Parzen windows estimator. [sent-168, score-0.575]
94 Classifiers built from this density estimator yield state-of-the-art knowledge-free performance, which is remarkable for a not discriminatively trained classifier. [sent-169, score-0.244]
95 Besides, in some applications, the accurate estimation of probabilities can be crucial, e. [sent-170, score-0.039]
96 Future work should consider other alternative methods of estimating the local covariance matrix, for example as suggested here using a weighted estimator, or taking advantage of prior knowledge (e. [sent-173, score-0.18]
97 Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14. [sent-184, score-0.023]
98 On the estimation of a probability density function and mode. [sent-188, score-0.219]
99 Transformation invariance in pattern recognition — tangent distance and tangent propagation. [sent-235, score-0.195]
100 The multi-class measure problem in nearest neighbour discrimination rules. [sent-292, score-0.106]
wordName wordTfidf (topN-words)
[('parzen', 0.684), ('mparzen', 0.31), ('manifold', 0.244), ('windows', 0.225), ('density', 0.18), ('ordinary', 0.137), ('spherical', 0.113), ('nearest', 0.106), ('covariance', 0.105), ('anll', 0.103), ('neighborhood', 0.097), ('directions', 0.095), ('tangent', 0.085), ('ancll', 0.083), ('validation', 0.078), ('local', 0.075), ('holes', 0.072), ('neighbors', 0.071), ('classi', 0.071), ('principal', 0.07), ('eigenvectors', 0.068), ('neighbor', 0.068), ('mixtures', 0.067), ('estimator', 0.064), ('bumpyness', 0.062), ('pancakes', 0.062), ('gaussian', 0.062), ('kernel', 0.056), ('training', 0.056), ('eigenvalues', 0.052), ('comparative', 0.049), ('underlying', 0.047), ('matrix', 0.042), ('dahmen', 0.041), ('keysers', 0.041), ('kiel', 0.041), ('localgaussian', 0.041), ('pcas', 0.041), ('realworld', 0.041), ('supposing', 0.041), ('vincent', 0.041), ('gaussians', 0.041), ('traditional', 0.041), ('mention', 0.039), ('estimation', 0.039), ('width', 0.039), ('obermayer', 0.038), ('artifacts', 0.038), ('dimensional', 0.037), ('points', 0.037), ('ce', 0.037), ('ocr', 0.036), ('concentrated', 0.035), ('thrun', 0.034), ('centered', 0.033), ('isotropic', 0.033), ('excessive', 0.033), ('along', 0.033), ('table', 0.032), ('becker', 0.032), ('mostly', 0.031), ('estimators', 0.031), ('editors', 0.031), ('abbreviated', 0.031), ('test', 0.031), ('storage', 0.031), ('noise', 0.029), ('svms', 0.029), ('vicinity', 0.029), ('usps', 0.029), ('hard', 0.028), ('association', 0.028), ('singular', 0.028), ('mnist', 0.027), ('german', 0.026), ('true', 0.026), ('assigning', 0.026), ('svd', 0.025), ('touretzky', 0.025), ('cation', 0.025), ('distance', 0.025), ('regularization', 0.025), ('global', 0.024), ('inverse', 0.024), ('diagonal', 0.023), ('volume', 0.023), ('locally', 0.022), ('already', 0.022), ('probabilistic', 0.022), ('plane', 0.021), ('advances', 0.021), ('probably', 0.021), ('adding', 0.02), ('leading', 0.02), ('aligned', 0.02), ('regularized', 0.02), ('input', 0.02), ('mixture', 0.02), ('requirement', 0.019), ('covariances', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 138 nips-2002-Manifold Parzen Windows
Author: Pascal Vincent, Yoshua Bengio
Abstract: The similarity between objects is a fundamental element of many learning algorithms. Most non-parametric methods take this similarity to be fixed, but much recent work has shown the advantages of learning it, in particular to exploit the local invariances in the data or to capture the possibly non-linear manifold on which most of the data lies. We propose a new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices. Experiments in density estimation show significant improvements with respect to Parzen density estimators. The density estimators can also be used within Bayes classifiers, yielding classification rates similar to SVMs and much superior to the Parzen classifier.
2 0.18288222 49 nips-2002-Charting a Manifold
Author: Matthew Brand
Abstract: We construct a nonlinear mapping from a high-dimensional sample space to a low-dimensional vector space, effectively recovering a Cartesian coordinate system for the manifold from which the data is sampled. The mapping preserves local geometric relations in the manifold and is pseudo-invertible. We show how to estimate the intrinsic dimensionality of the manifold from samples, decompose the sample data into locally linear low-dimensional patches, merge these patches into a single lowdimensional coordinate system, and compute forward and reverse mappings between the sample and coordinate spaces. The objective functions are convex and their solutions are given in closed form. 1 Nonlinear dimensionality reduction (NLDR) by charting Charting is the problem of assigning a low-dimensional coordinate system to data points in a high-dimensional sample space. It is presumed that the data lies on or near a lowdimensional manifold embedded in the sample space, and that there exists a 1-to-1 smooth nonlinear transform between the manifold and a low-dimensional vector space. The datamodeler’s goal is to estimate smooth continuous mappings between the sample and coordinate spaces. Often this analysis will shed light on the intrinsic variables of the datagenerating phenomenon, for example, revealing perceptual or configuration spaces. Our goal is to find a mapping—expressed as a kernel-based mixture of linear projections— that minimizes information loss about the density and relative locations of sample points. This constraint is expressed in a posterior that combines a standard gaussian mixture model (GMM) likelihood function with a prior that penalizes uncertainty due to inconsistent projections in the mixture. Section 3 develops a special case where this posterior is unimodal and maximizable in closed form, yielding a GMM whose covariances reveal a patchwork of overlapping locally linear subspaces that cover the manifold. Section 4 shows that for this (or any) GMM and a choice of reduced dimension d, there is a unique, closed-form solution for a minimally distorting merger of the subspaces into a d-dimensional coordinate space, as well as an reverse mapping defining the surface of the manifold in the sample space. The intrinsic dimensionality d of the data manifold can be estimated from the growth process of point-to-point distances. In analogy to differential geometry, we call the subspaces “charts” and their merger the “connection.” Section 5 considers example problems where these methods are used to untie knots, unroll and untwist sheets, and visualize video data. 1.1 Background Topology-neutral NLDR algorithms can be divided into those that compute mappings, and those that directly compute low-dimensional embeddings. The fi has its roots in mapeld ping algorithms: DeMers and Cottrell [3] proposed using auto-encoding neural networks with a hidden layer “ bottleneck,” effectively casting dimensionality reduction as a compression problem. Hastie defi principal curves [5] as nonparametric 1 D curves that pass ned through the center of “ nearby” data points. A rich literature has grown up around properly regularizing this approach and extending it to surfaces. Smola and colleagues [10] analyzed the NLDR problem in the broader framework of regularized quantization methods. More recent advances aim for embeddings: Gomes and Mojsilovic [4] treat manifold completion as an anisotropic diffusion problem, iteratively expanding points until they connect to their neighbors. The I SO M AP algorithm [12] represents remote distances as sums of a trusted set of distances between immediate neighbors, then uses multidimensional scaling to compute a low-dimensional embedding that minimally distorts all distances. The locally linear embedding algorithm (LLE) [9] represents each point as a weighted combination of a trusted set of nearest neighbors, then computes a minimally distorting low-dimensional barycentric embedding. They have complementary strengths: I SO M AP handles holes well but can fail if the data hull is nonconvex [12]; and vice versa for LLE [9]. Both offer embeddings without mappings. It has been noted that trusted-set methods are vulnerable to noise because they consider the subset of point-to-point relationships that has the lowest signal-to-noise ratio; small changes to the trusted set can induce large changes in the set of constraints on the embedding, making solutions unstable [1]. In a return to mapping, Roweis and colleagues [8] proposed global coordination— learning a mixture of locally linear projections from sample to coordinate space. They constructed a posterior that penalizes distortions in the mapping, and gave a expectation-maximization (EM) training rule. Innovative use of variational methods highlighted the diffi culty of even hill-climbing their multimodal posterior. Like [2, 7, 6, 8], the method we develop below is a decomposition of the manifold into locally linear neighborhoods. It bears closest relation to global coordination [8], although by a different construction of the problem, we avoid hill-climbing a spiky posterior and instead develop a closed-form solution. 2 Estimating locally linear scale and intrinsic dimensionality . We begin with matrix of sample points Y = [y1 , · · · , yN ], yn ∈ RD populating a Ddimensional sample space, and a conjecture that these points are samples from a manifold M of intrinsic dimensionality d < D. We seek a mapping onto a vector space . G(Y) → X = [x1 , · · · , xN ], xn ∈ Rd and 1-to-1 reverse mapping G−1 (X) → Y such that local relations between nearby points are preserved (this will be formalized below). The map G should be non-catastrophic, that is, without folds: Parallel lines on the manifold in RD should map to continuous smooth non-intersecting curves in Rd . This guarantees that linear operations on X such as interpolation will have reasonable analogues on Y. Smoothness means that at some scale r the mapping from a neighborhood on M to Rd is effectively linear. Consider a ball of radius r centered on a data point and containing n(r) data points. The count n(r) grows as rd , but only at the locally linear scale; the grow rate is inflated by isotropic noise at smaller scales and by embedding curvature at larger scales. . To estimate r, we look at how the r-ball grows as points are added to it, tracking c(r) = d d log n(r) log r. At noise scales, c(r) ≈ 1/D < 1/d, because noise has distributed points in all directions with equal probability. At the scale at which curvature becomes signifi cant, c(r) < 1/d, because the manifold is no longer perpendicular to the surface of the ball, so the ball does not have to grow as fast to accommodate new points. At the locally linear scale, the process peaks at c(r) = 1/d, because points are distributed only in the directions of the manifold’s local tangent space. The maximum of c(r) therefore gives an estimate of both the scale and the local dimensionality of the manifold (see fi gure 1), provided that the ball hasn’t expanded to a manifold boundary— boundaries have lower dimension than Scale behavior of a 1D manifold in 2-space Point−count growth process on a 2D manifold in 3−space 1 10 radial growth process 1D hypothesis 2D hypothesis 3D hypothesis radius (log scale) samples noise scale locally linear scale curvature scale 0 10 2 1 10 2 10 #points (log scale) 3 10 Figure 1: Point growth processes. L EFT: At the locally linear scale, the number of points in an r-ball grows as rd ; at noise and curvature scales it grows faster. R IGHT: Using the point-count growth process to fi the intrinsic dimensionality of a 2D manifold nonlinearly nd embedded in 3-space (see fi gure 2). Lines of slope 1/3 , 1/2 , and 1 are fi tted to sections of the log r/ log nr curve. For neighborhoods of radius r ≈ 1 with roughly n ≈ 10 points, the slope peaks at 1/2 indicating a dimensionality of d = 2. Below that, the data appears 3 D because it is dominated by noise (except for n ≤ D points); above, the data appears >2 D because of manifold curvature. As the r-ball expands to cover the entire data-set the dimensionality appears to drop to 1 as the process begins to track the 1D edges of the 2D sheet. the manifold. For low-dimensional manifolds such as sheets, the boundary submanifolds (edges and corners) are very small relative to the full manifold, so the boundary effect is typically limited to a small rise in c(r) as r approaches the scale of the entire data set. In practice, our code simply expands an r-ball at every point and looks for the fi peak in rst c(r), averaged over many nearby r-balls. One can estimate d and r globally or per-point. 3 Charting the data In the charting step we fi a soft partitioning of the data into locally linear low-dimensional nd neighborhoods, as a prelude to computing the connection that gives the global lowdimensional embedding. To minimize information loss in the connection, we require that the data points project into a subspace associated with each neighborhood with (1) minimal loss of local variance and (2) maximal agreement of the projections of nearby points into nearby neighborhoods. Criterion (1) is served by maximizing the likelihood function of a Gaussian mixture model (GMM) density fi tted to the data: . p(yi |µ, Σ) = ∑ j p(yi |µ j , Σ j ) p j = ∑ j N (yi ; µ j , Σ j ) p j . (1) Each gaussian component defi a local neighborhood centered around µ j with axes denes fi ned by the eigenvectors of Σ j . The amount of data variance along each axis is indicated by the eigenvalues of Σ j ; if the data manifold is locally linear in the vicinity of the µ j , all but the d dominant eigenvalues will be near-zero, implying that the associated eigenvectors constitute the optimal variance-preserving local coordinate system. To some degree likelihood maximization will naturally realize this property: It requires that the GMM components shrink in volume to fi the data as tightly as possible, which is best achieved by t positioning the components so that they “ pancake” onto locally flat collections of datapoints. However, this state of affairs is easily violated by degenerate (zero-variance) GMM components or components fi tted to overly small enough locales where the data density off the manifold is comparable to density on the manifold (e.g., at the noise scale). Consequently a prior is needed. Criterion (2) implies that neighboring partitions should have dominant axes that span similar subspaces, since disagreement (large subspace angles) would lead to inconsistent projections of a point and therefore uncertainty about its location in a low-dimensional coordinate space. The principal insight is that criterion (2) is exactly the cost of coding the location of a point in one neighborhood when it is generated by another neighborhood— the cross-entropy between the gaussian models defi ning the two neighborhoods: D(N1 N2 ) = = dy N (y; µ1 ,Σ1 ) log N (y; µ1 ,Σ1 ) N (y; µ2 ,Σ2 ) (log |Σ−1 Σ2 | + trace(Σ−1 Σ1 ) + (µ2 −µ1 ) Σ−1 (µ2 −µ1 ) − D)/2. (2) 1 2 2 Roughly speaking, the terms in (2) measure differences in size, orientation, and position, respectively, of two coordinate frames located at the means µ1 , µ2 with axes specifi by ed the eigenvectors of Σ1 , Σ2 . All three terms decline to zero as the overlap between the two frames is maximized. To maximize consistency between adjacent neighborhoods, we form . the prior p(µ, Σ) = exp[− ∑i= j mi (µ j )D(Ni N j )], where mi (µ j ) is a measure of co-locality. Unlike global coordination [8], we are not asking that the dominant axes in neighboring charts are aligned— only that they span nearly the same subspace. This is a much easier objective to satisfy, and it contains a useful special case where the posterior p(µ, Σ|Y) ∝ ∑i p(yi |µ, Σ)p(µ, Σ) is unimodal and can be maximized in closed form: Let us associate a gaussian neighborhood with each data-point, setting µi = yi ; take all neighborhoods to be a priori equally probable, setting pi = 1/N; and let the co-locality measure be determined from some local kernel. For example, in this paper we use mi (µ j ) ∝ N (µ j ; µi , σ2 ), with the scale parameter σ specifying the expected size of a neighborhood on the manifold in sample space. A reasonable choice is σ = r/2, so that 2erf(2) > 99.5% of the density of mi (µ j ) is contained in the area around yi where the manifold is expected to be locally linear. With uniform pi and µi , mi (µ j ) and fi xed, the MAP estimates of the GMM covariances are Σi = ∑ mi (µ j ) (y j − µi )(y j − µi ) + (µ j − µi )(µ j − µi ) + Σ j j ∑ mi (µ j ) (3) . j Note that each covariance Σi is dependent on all other Σ j . The MAP estimators for all covariances can be arranged into a set of fully constrained linear equations and solved exactly for their mutually optimal values. This key step brings nonlocal information about the manifold’s shape into the local description of each neighborhood, ensuring that adjoining neighborhoods have similar covariances and small angles between their respective subspaces. Even if a local subset of data points are dense in a direction perpendicular to the manifold, the prior encourages the local chart to orient parallel to the manifold as part of a globally optimal solution, protecting against a pathology noted in [8]. Equation (3) is easily adapted to give a reduced number of charts and/or charts centered on local centroids. 4 Connecting the charts We now build a connection for set of charts specifi as an arbitrary nondegenerate GMM. A ed GMM gives a soft partitioning of the dataset into neighborhoods of mean µk and covariance Σk . The optimal variance-preserving low-dimensional coordinate system for each neighborhood derives from its weighted principal component analysis, which is exactly specifi ed by the eigenvectors of its covariance matrix: Eigendecompose Vk Λk Vk ← Σk with eigen. values in descending order on the diagonal of Λk and let Wk = [Id , 0]Vk be the operator . th projecting points into the k local chart, such that local chart coordinate uki = Wk (yi − µk ) . and Uk = [uk1 , · · · , ukN ] holds the local coordinates of all points. Our goal is to sew together all charts into a globally consistent low-dimensional coordinate system. For each chart there will be a low-dimensional affi transform Gk ∈ R(d+1)×d ne that projects Uk into the global coordinate space. Summing over all charts, the weighted average of the projections of point yi into the low-dimensional vector space is W j (y − µ j ) 1 . x|y = ∑ G j j p j|y (y) . xi |yi = ∑ G j ⇒ u ji 1 j p j|y (yi ), (4) where pk|y (y) ∝ pk N (y; µk , Σk ), ∑k pk|y (y) = 1 is the probability that chart k generates point y. As pointed out in [8], if a point has nonzero probabilities in two charts, then there should be affi transforms of those two charts that map the point to the same place in a ne global coordinate space. We set this up as a weighted least-squares problem: . G = [G1 , · · · , GK ] = arg min uki 1 ∑ pk|y (yi )p j|y (yi ) Gk Gk ,G j i −Gj u ji 1 2 . (5) F Equation (5) generates a homogeneous set of equations that determines a solution up to an affi transform of G. There are two solution methods. First, let us temporarily anchor one ne neighborhood at the origin to fi this indeterminacy. This adds the constraint G1 = [I, 0] . x . To solve, defi indicator matrix Fk = [0, · · · , 0, I, 0, · · · , 0] with the identity mane . trix occupying the kth block, such that Gk = GFk . Let the diagonal of Pk = diag([pk|y (y1 ), · · · , pk|y (yN )]) record the per-point posteriors of chart k. The squared error of the connection is then a sum of of all patch-to-anchor and patch-to-patch inconsistencies: . E =∑ (GUk − k U1 0 2 )Pk P1 F + ∑ (GU j − GUk )P j Pk j=k 2 F ; . Uk = Fk Uk 1 . (6) Setting dE /dG = 0 and solving to minimize convex E gives −1 G = ∑ Uk P2 k k ∑ j=k P2 j Uk − ∑ ∑ Uk P2 P2 k 1 Uk P2 P2 U j k j k j=k U1 0 . (7) We now remove the dependence on a reference neighborhood G1 by rewriting equation 5, G = arg min ∑ j=k (GU j − GUk )P j Pk G 2 F = GQ 2 F = trace(GQQ G ) , (8) . where Q = ∑ j=k U j − Uk P j Pk . If we require that GG = I to prevent degenerate solutions, then equation (8) is solved (up to rotation in coordinate space) by setting G to the eigenvectors associated with the smallest eigenvalues of QQ . The eigenvectors can be computed effi ciently without explicitly forming QQ ; other numerical effi ciencies obtain by zeroing any vanishingly small probabilities in each Pk , yielding a sparse eigenproblem. A more interesting strategy is to numerically condition the problem by calculating the trailing eigenvectors of QQ + 1. It can be shown that this maximizes the posterior 2 p(G|Q) ∝ p(Q|G)p(G) ∝ e− GQ F e− G1 , where the prior p(G) favors a mapping G whose unit-norm rows are also zero-mean. This maximizes variance in each row of G and thereby spreads the projected points broadly and evenly over coordinate space. The solutions for MAP charts (equation (5)) and connection (equation (8)) can be applied to any well-fi tted mixture of gaussians/factors1 /PCAs density model; thus large eigenproblems can be avoided by connecting just a small number of charts that cover the data. 1 We thank reviewers for calling our attention to Teh & Roweis ([11]— in this volume), which shows how to connect a set of given local dimensionality reducers in a generalized eigenvalue problem that is related to equation (8). LLE, n=5 charting (projection onto coordinate space) charting best Isomap LLE, n=6 LLE, n=7 LLE, n=8 random subset of local charts XYZ view LLE, n=9 LLE, n=10 XZ view data (linked) embedding, XY view XY view original data reconstruction (back−projected coordinate grid) best LLE (regularized) Figure 2: The twisted curl problem. L EFT: Comparison of charting, I SO M AP, & LLE. 400 points are randomly sampled from the manifold with noise. Charting is the only method that recovers the original space without catastrophes (folding), albeit with some shear. R IGHT: The manifold is regularly sampled (with noise) to illustrate the forward and backward projections. Samples are shown linked into lines to help visualize the manifold structure. Coordinate axes of a random selection of charts are shown as bold lines. Connecting subsets of charts such as this will also give good mappings. The upper right quadrant shows various LLE results. At bottom we show the charting solution and the reconstructed (back-projected) manifold, which smooths out the noise. Once the connection is solved, equation (4) gives the forward projection of any point y down into coordinate space. There are several numerically distinct candidates for the backprojection: posterior mean, mode, or exact inverse. In general, there may not be a unique posterior mode and the exact inverse is not solvable in closed form (this is also true of [8]). Note that chart-wise projection defi a complementary density in coordinate space nes px|k (x) = N (x; Gk 0 1 , Gk [Id , 0]Λk [Id , 0] 0 0 0 Gk ). (9) Let p(y|x, k), used to map x into subspace k on the surface of the manifold, be a Dirac delta function whose mean is a linear function of x. Then the posterior mean back-projection is obtained by integrating out uncertainty over which chart generates x: y|x = ∑ pk|x (x) k µk + Wk Gk I 0 + x − Gk 0 1 , (10) where (·)+ denotes pseudo-inverse. In general, a back-projecting map should not reconstruct the original points. Instead, equation (10) generates a surface that passes through the weighted average of the µi of all the neighborhoods in which yi has nonzero probability, much like a principal curve passes through the center of each local group of points. 5 Experiments Synthetic examples: 400 2 D points were randomly sampled from a 2 D square and embedded in 3 D via a curl and twist, then contaminated with gaussian noise. Even if noiselessly sampled, this manifold cannot be “ unrolled” without distortion. In addition, the outer curl is sampled much less densely than the inner curl. With an order of magnitude fewer points, higher noise levels, no possibility of an isometric mapping, and uneven sampling, this is arguably a much more challenging problem than the “ swiss roll” and “ s-curve” problems featured in [12, 9, 8, 1]. Figure 2LEFT contrasts the (unique) output of charting and the best outputs obtained from I SO M AP and LLE (considering all neighborhood sizes between 2 and 20 points). I SO M AP and LLE show catastrophic folding; we had to change LLE’s b. data, yz view c. local charts d. 2D embedding e. 1D embedding 1D ordinate a. data, xy view true manifold arc length Figure 3: Untying a trefoil knot ( ) by charting. 900 noisy samples from a 3 D-embedded 1 D manifold are shown as connected dots in front (a) and side (b) views. A subset of charts is shown in (c). Solving for the 2 D connection gives the “ unknot” in (d). After removing some points to cut the knot, charting gives a 1 D embedding which we plot against true manifold arc length in (e); monotonicity (modulo noise) indicates correctness. Three principal degrees of freedom recovered from raw jittered images pose scale expression images synthesized via backprojection of straight lines in coordinate space Figure 4: Modeling the manifold of facial images from raw video. Each row contains images synthesized by back-projecting an axis-parallel straight line in coordinate space onto the manifold in image space. Blurry images correspond to points on the manifold whose neighborhoods contain few if any nearby data points. regularization in order to coax out nondegenerate (>1 D) solutions. Although charting is not designed for isometry, after affi transform the forward-projected points disagree with ne the original points with an RMS error of only 1.0429, lower than the best LLE (3.1423) or best I SO M AP (1.1424, not shown). Figure 2RIGHT shows the same problem where points are sampled regularly from a grid, with noise added before and after embedding. Figure 3 shows a similar treatment of a 1 D line that was threaded into a 3 D trefoil knot, contaminated with gaussian noise, and then “ untied” via charting. Video: We obtained a 1965-frame video sequence (courtesy S. Roweis and B. Frey) of 20 × 28-pixel images in which B.F. strikes a variety of poses and expressions. The video is heavily contaminated with synthetic camera jitters. We used raw images, though image processing could have removed this and other uninteresting sources of variation. We took a 500-frame subsequence and left-right mirrored it to obtain 1000 points in 20 × 28 = 560D image space. The point-growth process peaked just above d = 3 dimensions. We solved for 25 charts, each centered on a random point, and a 3D connection. The recovered degrees of freedom— recognizable as pose, scale, and expression— are visualized in fi gure 4. original data stereographic map to 3D fishbowl charting Figure 5: Flattening a fi shbowl. From the left: Original 2000×2D points; their stereographic mapping to a 3D fi shbowl; its 2D embedding recovered using 500 charts; and the stereographic map. Fewer charts lead to isometric mappings that fold the bowl (not shown). Conformality: Some manifolds can be flattened conformally (preserving local angles) but not isometrically. Figure 5 shows that if the data is fi nely charted, the connection behaves more conformally than isometrically. This problem was suggested by J. Tenenbaum. 6 Discussion Charting breaks kernel-based NLDR into two subproblems: (1) Finding a set of datacovering locally linear neighborhoods (“ charts” ) such that adjoining neighborhoods span maximally similar subspaces, and (2) computing a minimal-distortion merger (“ connection” ) of all charts. The solution to (1) is optimal w.r.t. the estimated scale of local linearity r; the solution to (2) is optimal w.r.t. the solution to (1) and the desired dimensionality d. Both problems have Bayesian settings. By offloading the nonlinearity onto the kernels, we obtain least-squares problems and closed form solutions. This scheme is also attractive because large eigenproblems can be avoided by using a reduced set of charts. The dependence on r, like trusted-set methods, is a potential source of solution instability. In practice the point-growth estimate seems fairly robust to data perturbations (to be expected if the data density changes slowly over a manifold of integral Hausdorff dimension), while the use of a soft neighborhood partitioning appears to make charting solutions reasonably stable to variations in r. Eigenvalue stability analyses may prove useful here. Ultimately, we would prefer to integrate r out. In contrast, use of d appears to be a virtue: Unlike other eigenvector-based methods, the best d-dimensional embedding is not merely a linear projection of the best d + 1-dimensional embedding; a unique distortion is found for each value of d that maximizes the information content of its embedding. Why does charting performs well on datasets where the signal-to-noise ratio confounds recent state-of-the-art methods? Two reasons may be adduced: (1) Nonlocal information is used to construct both the system of local charts and their global connection. (2) The mapping only preserves the component of local point-to-point distances that project onto the manifold; relationships perpendicular to the manifold are discarded. Thus charting uses global shape information to suppress noise in the constraints that determine the mapping. Acknowledgments Thanks to J. Buhmann, S. Makar, S. Roweis, J. Tenenbaum, and anonymous reviewers for insightful comments and suggested “ challenge” problems. References [1] M. Balasubramanian and E. L. Schwartz. The IsoMap algorithm and topological stability. Science, 295(5552):7, January 2002. [2] C. Bregler and S. Omohundro. Nonlinear image interpolation using manifold learning. In NIPS–7, 1995. [3] D. DeMers and G. Cottrell. Nonlinear dimensionality reduction. In NIPS–5, 1993. [4] J. Gomes and A. Mojsilovic. A variational approach to recovering a manifold from sample points. In ECCV, 2002. [5] T. Hastie and W. Stuetzle. Principal curves. J. Am. Statistical Assoc, 84(406):502–516, 1989. [6] G. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of handwritten digits. IEEE Trans. Neural Networks, 8, 1997. [7] N. Kambhatla and T. Leen. Dimensionality reduction by local principal component analysis. Neural Computation, 9, 1997. [8] S. Roweis, L. Saul, and G. Hinton. Global coordination of linear models. In NIPS–13, 2002. [9] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. [10] A. Smola, S. Mika, B. Schölkopf, and R. Williamson. Regularized principal manifolds. Machine Learning, 1999. [11] Y. W. Teh and S. T. Roweis. Automatic alignment of hidden representations. In NIPS–15, 2003. [12] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000.
3 0.11025158 36 nips-2002-Automatic Alignment of Local Representations
Author: Yee W. Teh, Sam T. Roweis
Abstract: We present an automatic alignment procedure which maps the disparate internal representations learned by several local dimensionality reduction experts into a single, coherent global coordinate system for the original data space. Our algorithm can be applied to any set of experts, each of which produces a low-dimensional local representation of a highdimensional input. Unlike recent efforts to coordinate such models by modifying their objective functions [1, 2], our algorithm is invoked after training and applies an efficient eigensolver to post-process the trained models. The post-processing has no local optima and the size of the system it must solve scales with the number of local models rather than the number of original data points, making it more efficient than model-free algorithms such as Isomap [3] or LLE [4]. 1 Introduction: Local vs. Global Dimensionality Reduction Beyond density modelling, an important goal of unsupervised learning is to discover compact, informative representations of high-dimensional data. If the data lie on a smooth low dimensional manifold, then an excellent encoding is the coordinates internal to that manifold. The process of determining such coordinates is dimensionality reduction. Linear dimensionality reduction methods such as principal component analysis and factor analysis are easy to train but cannot capture the structure of curved manifolds. Mixtures of these simple unsupervised models [5, 6, 7, 8] have been used to perform local dimensionality reduction, and can provide good density models for curved manifolds, but unfortunately such mixtures cannot do dimensionality reduction. They do not describe a single, coherent low-dimensional coordinate system for the data since there is no pressure for the local coordinates of each component to agree. Roweis et al [1] recently proposed a model which performs global coordination of local coordinate systems in a mixture of factor analyzers (MFA). Their model is trained by maximizing the likelihood of the data, with an additional variational penalty term to encourage the internal coordinates of the factor analyzers to agree. While their model can trade off modelling the data and having consistent local coordinate systems, it requires a user given trade-off parameter, training is quite inefficient (although [2] describes an improved training algorithm for a more constrained model), and it has quite serious local minima problems (methods like LLE [4] or Isomap [3] have to be used for initialization). In this paper we describe a novel, automatic way to align the hidden representations used by each component of a mixture of dimensionality reducers into a single global representation of the data throughout space. Given an already trained mixture, the alignment is achieved by applying an eigensolver to a matrix constructed from the internal representations of the mixture components. Our method is efficient, simple to implement, and has no local optima in its optimization nor any learning rates or annealing schedules. 2 The Locally Linear Coordination Algorithm H 9¥ EI¡ CD66B9 ©9B 766 % G F 5 #
4 0.090322383 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter
Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.
5 0.081702009 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf
Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1
6 0.07791768 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
7 0.075871944 113 nips-2002-Information Diffusion Kernels
8 0.073815465 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction
9 0.071942128 124 nips-2002-Learning Graphical Models with Mercer Kernels
10 0.0718152 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
11 0.07072524 156 nips-2002-On the Complexity of Learning the Kernel Matrix
12 0.067209259 111 nips-2002-Independent Components Analysis through Product Density Estimation
13 0.066618957 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
14 0.065345131 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
15 0.063997149 119 nips-2002-Kernel Dependency Estimation
16 0.062393457 181 nips-2002-Self Supervised Boosting
17 0.058807999 120 nips-2002-Kernel Design Using Boosting
18 0.058782592 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
19 0.058154281 115 nips-2002-Informed Projections
20 0.056149244 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
topicId topicWeight
[(0, -0.188), (1, -0.079), (2, 0.039), (3, 0.015), (4, -0.013), (5, 0.001), (6, -0.021), (7, 0.017), (8, -0.026), (9, 0.161), (10, 0.04), (11, 0.017), (12, 0.058), (13, -0.066), (14, 0.102), (15, 0.151), (16, -0.171), (17, 0.075), (18, 0.061), (19, -0.07), (20, 0.029), (21, 0.079), (22, 0.024), (23, -0.064), (24, -0.025), (25, -0.027), (26, -0.076), (27, -0.005), (28, -0.048), (29, -0.014), (30, -0.015), (31, -0.043), (32, -0.032), (33, -0.091), (34, 0.032), (35, 0.008), (36, 0.015), (37, -0.018), (38, -0.049), (39, -0.081), (40, 0.081), (41, 0.02), (42, 0.002), (43, -0.027), (44, 0.016), (45, 0.07), (46, 0.023), (47, -0.05), (48, 0.076), (49, -0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.93378383 138 nips-2002-Manifold Parzen Windows
Author: Pascal Vincent, Yoshua Bengio
Abstract: The similarity between objects is a fundamental element of many learning algorithms. Most non-parametric methods take this similarity to be fixed, but much recent work has shown the advantages of learning it, in particular to exploit the local invariances in the data or to capture the possibly non-linear manifold on which most of the data lies. We propose a new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices. Experiments in density estimation show significant improvements with respect to Parzen density estimators. The density estimators can also be used within Bayes classifiers, yielding classification rates similar to SVMs and much superior to the Parzen classifier.
2 0.84421599 49 nips-2002-Charting a Manifold
Author: Matthew Brand
Abstract: We construct a nonlinear mapping from a high-dimensional sample space to a low-dimensional vector space, effectively recovering a Cartesian coordinate system for the manifold from which the data is sampled. The mapping preserves local geometric relations in the manifold and is pseudo-invertible. We show how to estimate the intrinsic dimensionality of the manifold from samples, decompose the sample data into locally linear low-dimensional patches, merge these patches into a single lowdimensional coordinate system, and compute forward and reverse mappings between the sample and coordinate spaces. The objective functions are convex and their solutions are given in closed form. 1 Nonlinear dimensionality reduction (NLDR) by charting Charting is the problem of assigning a low-dimensional coordinate system to data points in a high-dimensional sample space. It is presumed that the data lies on or near a lowdimensional manifold embedded in the sample space, and that there exists a 1-to-1 smooth nonlinear transform between the manifold and a low-dimensional vector space. The datamodeler’s goal is to estimate smooth continuous mappings between the sample and coordinate spaces. Often this analysis will shed light on the intrinsic variables of the datagenerating phenomenon, for example, revealing perceptual or configuration spaces. Our goal is to find a mapping—expressed as a kernel-based mixture of linear projections— that minimizes information loss about the density and relative locations of sample points. This constraint is expressed in a posterior that combines a standard gaussian mixture model (GMM) likelihood function with a prior that penalizes uncertainty due to inconsistent projections in the mixture. Section 3 develops a special case where this posterior is unimodal and maximizable in closed form, yielding a GMM whose covariances reveal a patchwork of overlapping locally linear subspaces that cover the manifold. Section 4 shows that for this (or any) GMM and a choice of reduced dimension d, there is a unique, closed-form solution for a minimally distorting merger of the subspaces into a d-dimensional coordinate space, as well as an reverse mapping defining the surface of the manifold in the sample space. The intrinsic dimensionality d of the data manifold can be estimated from the growth process of point-to-point distances. In analogy to differential geometry, we call the subspaces “charts” and their merger the “connection.” Section 5 considers example problems where these methods are used to untie knots, unroll and untwist sheets, and visualize video data. 1.1 Background Topology-neutral NLDR algorithms can be divided into those that compute mappings, and those that directly compute low-dimensional embeddings. The fi has its roots in mapeld ping algorithms: DeMers and Cottrell [3] proposed using auto-encoding neural networks with a hidden layer “ bottleneck,” effectively casting dimensionality reduction as a compression problem. Hastie defi principal curves [5] as nonparametric 1 D curves that pass ned through the center of “ nearby” data points. A rich literature has grown up around properly regularizing this approach and extending it to surfaces. Smola and colleagues [10] analyzed the NLDR problem in the broader framework of regularized quantization methods. More recent advances aim for embeddings: Gomes and Mojsilovic [4] treat manifold completion as an anisotropic diffusion problem, iteratively expanding points until they connect to their neighbors. The I SO M AP algorithm [12] represents remote distances as sums of a trusted set of distances between immediate neighbors, then uses multidimensional scaling to compute a low-dimensional embedding that minimally distorts all distances. The locally linear embedding algorithm (LLE) [9] represents each point as a weighted combination of a trusted set of nearest neighbors, then computes a minimally distorting low-dimensional barycentric embedding. They have complementary strengths: I SO M AP handles holes well but can fail if the data hull is nonconvex [12]; and vice versa for LLE [9]. Both offer embeddings without mappings. It has been noted that trusted-set methods are vulnerable to noise because they consider the subset of point-to-point relationships that has the lowest signal-to-noise ratio; small changes to the trusted set can induce large changes in the set of constraints on the embedding, making solutions unstable [1]. In a return to mapping, Roweis and colleagues [8] proposed global coordination— learning a mixture of locally linear projections from sample to coordinate space. They constructed a posterior that penalizes distortions in the mapping, and gave a expectation-maximization (EM) training rule. Innovative use of variational methods highlighted the diffi culty of even hill-climbing their multimodal posterior. Like [2, 7, 6, 8], the method we develop below is a decomposition of the manifold into locally linear neighborhoods. It bears closest relation to global coordination [8], although by a different construction of the problem, we avoid hill-climbing a spiky posterior and instead develop a closed-form solution. 2 Estimating locally linear scale and intrinsic dimensionality . We begin with matrix of sample points Y = [y1 , · · · , yN ], yn ∈ RD populating a Ddimensional sample space, and a conjecture that these points are samples from a manifold M of intrinsic dimensionality d < D. We seek a mapping onto a vector space . G(Y) → X = [x1 , · · · , xN ], xn ∈ Rd and 1-to-1 reverse mapping G−1 (X) → Y such that local relations between nearby points are preserved (this will be formalized below). The map G should be non-catastrophic, that is, without folds: Parallel lines on the manifold in RD should map to continuous smooth non-intersecting curves in Rd . This guarantees that linear operations on X such as interpolation will have reasonable analogues on Y. Smoothness means that at some scale r the mapping from a neighborhood on M to Rd is effectively linear. Consider a ball of radius r centered on a data point and containing n(r) data points. The count n(r) grows as rd , but only at the locally linear scale; the grow rate is inflated by isotropic noise at smaller scales and by embedding curvature at larger scales. . To estimate r, we look at how the r-ball grows as points are added to it, tracking c(r) = d d log n(r) log r. At noise scales, c(r) ≈ 1/D < 1/d, because noise has distributed points in all directions with equal probability. At the scale at which curvature becomes signifi cant, c(r) < 1/d, because the manifold is no longer perpendicular to the surface of the ball, so the ball does not have to grow as fast to accommodate new points. At the locally linear scale, the process peaks at c(r) = 1/d, because points are distributed only in the directions of the manifold’s local tangent space. The maximum of c(r) therefore gives an estimate of both the scale and the local dimensionality of the manifold (see fi gure 1), provided that the ball hasn’t expanded to a manifold boundary— boundaries have lower dimension than Scale behavior of a 1D manifold in 2-space Point−count growth process on a 2D manifold in 3−space 1 10 radial growth process 1D hypothesis 2D hypothesis 3D hypothesis radius (log scale) samples noise scale locally linear scale curvature scale 0 10 2 1 10 2 10 #points (log scale) 3 10 Figure 1: Point growth processes. L EFT: At the locally linear scale, the number of points in an r-ball grows as rd ; at noise and curvature scales it grows faster. R IGHT: Using the point-count growth process to fi the intrinsic dimensionality of a 2D manifold nonlinearly nd embedded in 3-space (see fi gure 2). Lines of slope 1/3 , 1/2 , and 1 are fi tted to sections of the log r/ log nr curve. For neighborhoods of radius r ≈ 1 with roughly n ≈ 10 points, the slope peaks at 1/2 indicating a dimensionality of d = 2. Below that, the data appears 3 D because it is dominated by noise (except for n ≤ D points); above, the data appears >2 D because of manifold curvature. As the r-ball expands to cover the entire data-set the dimensionality appears to drop to 1 as the process begins to track the 1D edges of the 2D sheet. the manifold. For low-dimensional manifolds such as sheets, the boundary submanifolds (edges and corners) are very small relative to the full manifold, so the boundary effect is typically limited to a small rise in c(r) as r approaches the scale of the entire data set. In practice, our code simply expands an r-ball at every point and looks for the fi peak in rst c(r), averaged over many nearby r-balls. One can estimate d and r globally or per-point. 3 Charting the data In the charting step we fi a soft partitioning of the data into locally linear low-dimensional nd neighborhoods, as a prelude to computing the connection that gives the global lowdimensional embedding. To minimize information loss in the connection, we require that the data points project into a subspace associated with each neighborhood with (1) minimal loss of local variance and (2) maximal agreement of the projections of nearby points into nearby neighborhoods. Criterion (1) is served by maximizing the likelihood function of a Gaussian mixture model (GMM) density fi tted to the data: . p(yi |µ, Σ) = ∑ j p(yi |µ j , Σ j ) p j = ∑ j N (yi ; µ j , Σ j ) p j . (1) Each gaussian component defi a local neighborhood centered around µ j with axes denes fi ned by the eigenvectors of Σ j . The amount of data variance along each axis is indicated by the eigenvalues of Σ j ; if the data manifold is locally linear in the vicinity of the µ j , all but the d dominant eigenvalues will be near-zero, implying that the associated eigenvectors constitute the optimal variance-preserving local coordinate system. To some degree likelihood maximization will naturally realize this property: It requires that the GMM components shrink in volume to fi the data as tightly as possible, which is best achieved by t positioning the components so that they “ pancake” onto locally flat collections of datapoints. However, this state of affairs is easily violated by degenerate (zero-variance) GMM components or components fi tted to overly small enough locales where the data density off the manifold is comparable to density on the manifold (e.g., at the noise scale). Consequently a prior is needed. Criterion (2) implies that neighboring partitions should have dominant axes that span similar subspaces, since disagreement (large subspace angles) would lead to inconsistent projections of a point and therefore uncertainty about its location in a low-dimensional coordinate space. The principal insight is that criterion (2) is exactly the cost of coding the location of a point in one neighborhood when it is generated by another neighborhood— the cross-entropy between the gaussian models defi ning the two neighborhoods: D(N1 N2 ) = = dy N (y; µ1 ,Σ1 ) log N (y; µ1 ,Σ1 ) N (y; µ2 ,Σ2 ) (log |Σ−1 Σ2 | + trace(Σ−1 Σ1 ) + (µ2 −µ1 ) Σ−1 (µ2 −µ1 ) − D)/2. (2) 1 2 2 Roughly speaking, the terms in (2) measure differences in size, orientation, and position, respectively, of two coordinate frames located at the means µ1 , µ2 with axes specifi by ed the eigenvectors of Σ1 , Σ2 . All three terms decline to zero as the overlap between the two frames is maximized. To maximize consistency between adjacent neighborhoods, we form . the prior p(µ, Σ) = exp[− ∑i= j mi (µ j )D(Ni N j )], where mi (µ j ) is a measure of co-locality. Unlike global coordination [8], we are not asking that the dominant axes in neighboring charts are aligned— only that they span nearly the same subspace. This is a much easier objective to satisfy, and it contains a useful special case where the posterior p(µ, Σ|Y) ∝ ∑i p(yi |µ, Σ)p(µ, Σ) is unimodal and can be maximized in closed form: Let us associate a gaussian neighborhood with each data-point, setting µi = yi ; take all neighborhoods to be a priori equally probable, setting pi = 1/N; and let the co-locality measure be determined from some local kernel. For example, in this paper we use mi (µ j ) ∝ N (µ j ; µi , σ2 ), with the scale parameter σ specifying the expected size of a neighborhood on the manifold in sample space. A reasonable choice is σ = r/2, so that 2erf(2) > 99.5% of the density of mi (µ j ) is contained in the area around yi where the manifold is expected to be locally linear. With uniform pi and µi , mi (µ j ) and fi xed, the MAP estimates of the GMM covariances are Σi = ∑ mi (µ j ) (y j − µi )(y j − µi ) + (µ j − µi )(µ j − µi ) + Σ j j ∑ mi (µ j ) (3) . j Note that each covariance Σi is dependent on all other Σ j . The MAP estimators for all covariances can be arranged into a set of fully constrained linear equations and solved exactly for their mutually optimal values. This key step brings nonlocal information about the manifold’s shape into the local description of each neighborhood, ensuring that adjoining neighborhoods have similar covariances and small angles between their respective subspaces. Even if a local subset of data points are dense in a direction perpendicular to the manifold, the prior encourages the local chart to orient parallel to the manifold as part of a globally optimal solution, protecting against a pathology noted in [8]. Equation (3) is easily adapted to give a reduced number of charts and/or charts centered on local centroids. 4 Connecting the charts We now build a connection for set of charts specifi as an arbitrary nondegenerate GMM. A ed GMM gives a soft partitioning of the dataset into neighborhoods of mean µk and covariance Σk . The optimal variance-preserving low-dimensional coordinate system for each neighborhood derives from its weighted principal component analysis, which is exactly specifi ed by the eigenvectors of its covariance matrix: Eigendecompose Vk Λk Vk ← Σk with eigen. values in descending order on the diagonal of Λk and let Wk = [Id , 0]Vk be the operator . th projecting points into the k local chart, such that local chart coordinate uki = Wk (yi − µk ) . and Uk = [uk1 , · · · , ukN ] holds the local coordinates of all points. Our goal is to sew together all charts into a globally consistent low-dimensional coordinate system. For each chart there will be a low-dimensional affi transform Gk ∈ R(d+1)×d ne that projects Uk into the global coordinate space. Summing over all charts, the weighted average of the projections of point yi into the low-dimensional vector space is W j (y − µ j ) 1 . x|y = ∑ G j j p j|y (y) . xi |yi = ∑ G j ⇒ u ji 1 j p j|y (yi ), (4) where pk|y (y) ∝ pk N (y; µk , Σk ), ∑k pk|y (y) = 1 is the probability that chart k generates point y. As pointed out in [8], if a point has nonzero probabilities in two charts, then there should be affi transforms of those two charts that map the point to the same place in a ne global coordinate space. We set this up as a weighted least-squares problem: . G = [G1 , · · · , GK ] = arg min uki 1 ∑ pk|y (yi )p j|y (yi ) Gk Gk ,G j i −Gj u ji 1 2 . (5) F Equation (5) generates a homogeneous set of equations that determines a solution up to an affi transform of G. There are two solution methods. First, let us temporarily anchor one ne neighborhood at the origin to fi this indeterminacy. This adds the constraint G1 = [I, 0] . x . To solve, defi indicator matrix Fk = [0, · · · , 0, I, 0, · · · , 0] with the identity mane . trix occupying the kth block, such that Gk = GFk . Let the diagonal of Pk = diag([pk|y (y1 ), · · · , pk|y (yN )]) record the per-point posteriors of chart k. The squared error of the connection is then a sum of of all patch-to-anchor and patch-to-patch inconsistencies: . E =∑ (GUk − k U1 0 2 )Pk P1 F + ∑ (GU j − GUk )P j Pk j=k 2 F ; . Uk = Fk Uk 1 . (6) Setting dE /dG = 0 and solving to minimize convex E gives −1 G = ∑ Uk P2 k k ∑ j=k P2 j Uk − ∑ ∑ Uk P2 P2 k 1 Uk P2 P2 U j k j k j=k U1 0 . (7) We now remove the dependence on a reference neighborhood G1 by rewriting equation 5, G = arg min ∑ j=k (GU j − GUk )P j Pk G 2 F = GQ 2 F = trace(GQQ G ) , (8) . where Q = ∑ j=k U j − Uk P j Pk . If we require that GG = I to prevent degenerate solutions, then equation (8) is solved (up to rotation in coordinate space) by setting G to the eigenvectors associated with the smallest eigenvalues of QQ . The eigenvectors can be computed effi ciently without explicitly forming QQ ; other numerical effi ciencies obtain by zeroing any vanishingly small probabilities in each Pk , yielding a sparse eigenproblem. A more interesting strategy is to numerically condition the problem by calculating the trailing eigenvectors of QQ + 1. It can be shown that this maximizes the posterior 2 p(G|Q) ∝ p(Q|G)p(G) ∝ e− GQ F e− G1 , where the prior p(G) favors a mapping G whose unit-norm rows are also zero-mean. This maximizes variance in each row of G and thereby spreads the projected points broadly and evenly over coordinate space. The solutions for MAP charts (equation (5)) and connection (equation (8)) can be applied to any well-fi tted mixture of gaussians/factors1 /PCAs density model; thus large eigenproblems can be avoided by connecting just a small number of charts that cover the data. 1 We thank reviewers for calling our attention to Teh & Roweis ([11]— in this volume), which shows how to connect a set of given local dimensionality reducers in a generalized eigenvalue problem that is related to equation (8). LLE, n=5 charting (projection onto coordinate space) charting best Isomap LLE, n=6 LLE, n=7 LLE, n=8 random subset of local charts XYZ view LLE, n=9 LLE, n=10 XZ view data (linked) embedding, XY view XY view original data reconstruction (back−projected coordinate grid) best LLE (regularized) Figure 2: The twisted curl problem. L EFT: Comparison of charting, I SO M AP, & LLE. 400 points are randomly sampled from the manifold with noise. Charting is the only method that recovers the original space without catastrophes (folding), albeit with some shear. R IGHT: The manifold is regularly sampled (with noise) to illustrate the forward and backward projections. Samples are shown linked into lines to help visualize the manifold structure. Coordinate axes of a random selection of charts are shown as bold lines. Connecting subsets of charts such as this will also give good mappings. The upper right quadrant shows various LLE results. At bottom we show the charting solution and the reconstructed (back-projected) manifold, which smooths out the noise. Once the connection is solved, equation (4) gives the forward projection of any point y down into coordinate space. There are several numerically distinct candidates for the backprojection: posterior mean, mode, or exact inverse. In general, there may not be a unique posterior mode and the exact inverse is not solvable in closed form (this is also true of [8]). Note that chart-wise projection defi a complementary density in coordinate space nes px|k (x) = N (x; Gk 0 1 , Gk [Id , 0]Λk [Id , 0] 0 0 0 Gk ). (9) Let p(y|x, k), used to map x into subspace k on the surface of the manifold, be a Dirac delta function whose mean is a linear function of x. Then the posterior mean back-projection is obtained by integrating out uncertainty over which chart generates x: y|x = ∑ pk|x (x) k µk + Wk Gk I 0 + x − Gk 0 1 , (10) where (·)+ denotes pseudo-inverse. In general, a back-projecting map should not reconstruct the original points. Instead, equation (10) generates a surface that passes through the weighted average of the µi of all the neighborhoods in which yi has nonzero probability, much like a principal curve passes through the center of each local group of points. 5 Experiments Synthetic examples: 400 2 D points were randomly sampled from a 2 D square and embedded in 3 D via a curl and twist, then contaminated with gaussian noise. Even if noiselessly sampled, this manifold cannot be “ unrolled” without distortion. In addition, the outer curl is sampled much less densely than the inner curl. With an order of magnitude fewer points, higher noise levels, no possibility of an isometric mapping, and uneven sampling, this is arguably a much more challenging problem than the “ swiss roll” and “ s-curve” problems featured in [12, 9, 8, 1]. Figure 2LEFT contrasts the (unique) output of charting and the best outputs obtained from I SO M AP and LLE (considering all neighborhood sizes between 2 and 20 points). I SO M AP and LLE show catastrophic folding; we had to change LLE’s b. data, yz view c. local charts d. 2D embedding e. 1D embedding 1D ordinate a. data, xy view true manifold arc length Figure 3: Untying a trefoil knot ( ) by charting. 900 noisy samples from a 3 D-embedded 1 D manifold are shown as connected dots in front (a) and side (b) views. A subset of charts is shown in (c). Solving for the 2 D connection gives the “ unknot” in (d). After removing some points to cut the knot, charting gives a 1 D embedding which we plot against true manifold arc length in (e); monotonicity (modulo noise) indicates correctness. Three principal degrees of freedom recovered from raw jittered images pose scale expression images synthesized via backprojection of straight lines in coordinate space Figure 4: Modeling the manifold of facial images from raw video. Each row contains images synthesized by back-projecting an axis-parallel straight line in coordinate space onto the manifold in image space. Blurry images correspond to points on the manifold whose neighborhoods contain few if any nearby data points. regularization in order to coax out nondegenerate (>1 D) solutions. Although charting is not designed for isometry, after affi transform the forward-projected points disagree with ne the original points with an RMS error of only 1.0429, lower than the best LLE (3.1423) or best I SO M AP (1.1424, not shown). Figure 2RIGHT shows the same problem where points are sampled regularly from a grid, with noise added before and after embedding. Figure 3 shows a similar treatment of a 1 D line that was threaded into a 3 D trefoil knot, contaminated with gaussian noise, and then “ untied” via charting. Video: We obtained a 1965-frame video sequence (courtesy S. Roweis and B. Frey) of 20 × 28-pixel images in which B.F. strikes a variety of poses and expressions. The video is heavily contaminated with synthetic camera jitters. We used raw images, though image processing could have removed this and other uninteresting sources of variation. We took a 500-frame subsequence and left-right mirrored it to obtain 1000 points in 20 × 28 = 560D image space. The point-growth process peaked just above d = 3 dimensions. We solved for 25 charts, each centered on a random point, and a 3D connection. The recovered degrees of freedom— recognizable as pose, scale, and expression— are visualized in fi gure 4. original data stereographic map to 3D fishbowl charting Figure 5: Flattening a fi shbowl. From the left: Original 2000×2D points; their stereographic mapping to a 3D fi shbowl; its 2D embedding recovered using 500 charts; and the stereographic map. Fewer charts lead to isometric mappings that fold the bowl (not shown). Conformality: Some manifolds can be flattened conformally (preserving local angles) but not isometrically. Figure 5 shows that if the data is fi nely charted, the connection behaves more conformally than isometrically. This problem was suggested by J. Tenenbaum. 6 Discussion Charting breaks kernel-based NLDR into two subproblems: (1) Finding a set of datacovering locally linear neighborhoods (“ charts” ) such that adjoining neighborhoods span maximally similar subspaces, and (2) computing a minimal-distortion merger (“ connection” ) of all charts. The solution to (1) is optimal w.r.t. the estimated scale of local linearity r; the solution to (2) is optimal w.r.t. the solution to (1) and the desired dimensionality d. Both problems have Bayesian settings. By offloading the nonlinearity onto the kernels, we obtain least-squares problems and closed form solutions. This scheme is also attractive because large eigenproblems can be avoided by using a reduced set of charts. The dependence on r, like trusted-set methods, is a potential source of solution instability. In practice the point-growth estimate seems fairly robust to data perturbations (to be expected if the data density changes slowly over a manifold of integral Hausdorff dimension), while the use of a soft neighborhood partitioning appears to make charting solutions reasonably stable to variations in r. Eigenvalue stability analyses may prove useful here. Ultimately, we would prefer to integrate r out. In contrast, use of d appears to be a virtue: Unlike other eigenvector-based methods, the best d-dimensional embedding is not merely a linear projection of the best d + 1-dimensional embedding; a unique distortion is found for each value of d that maximizes the information content of its embedding. Why does charting performs well on datasets where the signal-to-noise ratio confounds recent state-of-the-art methods? Two reasons may be adduced: (1) Nonlocal information is used to construct both the system of local charts and their global connection. (2) The mapping only preserves the component of local point-to-point distances that project onto the manifold; relationships perpendicular to the manifold are discarded. Thus charting uses global shape information to suppress noise in the constraints that determine the mapping. Acknowledgments Thanks to J. Buhmann, S. Makar, S. Roweis, J. Tenenbaum, and anonymous reviewers for insightful comments and suggested “ challenge” problems. References [1] M. Balasubramanian and E. L. Schwartz. The IsoMap algorithm and topological stability. Science, 295(5552):7, January 2002. [2] C. Bregler and S. Omohundro. Nonlinear image interpolation using manifold learning. In NIPS–7, 1995. [3] D. DeMers and G. Cottrell. Nonlinear dimensionality reduction. In NIPS–5, 1993. [4] J. Gomes and A. Mojsilovic. A variational approach to recovering a manifold from sample points. In ECCV, 2002. [5] T. Hastie and W. Stuetzle. Principal curves. J. Am. Statistical Assoc, 84(406):502–516, 1989. [6] G. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of handwritten digits. IEEE Trans. Neural Networks, 8, 1997. [7] N. Kambhatla and T. Leen. Dimensionality reduction by local principal component analysis. Neural Computation, 9, 1997. [8] S. Roweis, L. Saul, and G. Hinton. Global coordination of linear models. In NIPS–13, 2002. [9] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. [10] A. Smola, S. Mika, B. Schölkopf, and R. Williamson. Regularized principal manifolds. Machine Learning, 1999. [11] Y. W. Teh and S. T. Roweis. Automatic alignment of hidden representations. In NIPS–15, 2003. [12] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000.
3 0.77509892 36 nips-2002-Automatic Alignment of Local Representations
Author: Yee W. Teh, Sam T. Roweis
Abstract: We present an automatic alignment procedure which maps the disparate internal representations learned by several local dimensionality reduction experts into a single, coherent global coordinate system for the original data space. Our algorithm can be applied to any set of experts, each of which produces a low-dimensional local representation of a highdimensional input. Unlike recent efforts to coordinate such models by modifying their objective functions [1, 2], our algorithm is invoked after training and applies an efficient eigensolver to post-process the trained models. The post-processing has no local optima and the size of the system it must solve scales with the number of local models rather than the number of original data points, making it more efficient than model-free algorithms such as Isomap [3] or LLE [4]. 1 Introduction: Local vs. Global Dimensionality Reduction Beyond density modelling, an important goal of unsupervised learning is to discover compact, informative representations of high-dimensional data. If the data lie on a smooth low dimensional manifold, then an excellent encoding is the coordinates internal to that manifold. The process of determining such coordinates is dimensionality reduction. Linear dimensionality reduction methods such as principal component analysis and factor analysis are easy to train but cannot capture the structure of curved manifolds. Mixtures of these simple unsupervised models [5, 6, 7, 8] have been used to perform local dimensionality reduction, and can provide good density models for curved manifolds, but unfortunately such mixtures cannot do dimensionality reduction. They do not describe a single, coherent low-dimensional coordinate system for the data since there is no pressure for the local coordinates of each component to agree. Roweis et al [1] recently proposed a model which performs global coordination of local coordinate systems in a mixture of factor analyzers (MFA). Their model is trained by maximizing the likelihood of the data, with an additional variational penalty term to encourage the internal coordinates of the factor analyzers to agree. While their model can trade off modelling the data and having consistent local coordinate systems, it requires a user given trade-off parameter, training is quite inefficient (although [2] describes an improved training algorithm for a more constrained model), and it has quite serious local minima problems (methods like LLE [4] or Isomap [3] have to be used for initialization). In this paper we describe a novel, automatic way to align the hidden representations used by each component of a mixture of dimensionality reducers into a single global representation of the data throughout space. Given an already trained mixture, the alignment is achieved by applying an eigensolver to a matrix constructed from the internal representations of the mixture components. Our method is efficient, simple to implement, and has no local optima in its optimization nor any learning rates or annealing schedules. 2 The Locally Linear Coordination Algorithm H 9¥ EI¡ CD66B9 ©9B 766 % G F 5 #
4 0.6809305 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction
Author: Vin D. Silva, Joshua B. Tenenbaum
Abstract: Recently proposed algorithms for nonlinear dimensionality reduction fall broadly into two categories which have different advantages and disadvantages: global (Isomap [1]), and local (Locally Linear Embedding [2], Laplacian Eigenmaps [3]). We present two variants of Isomap which combine the advantages of the global approach with what have previously been exclusive advantages of local methods: computational sparsity and the ability to invert conformal maps.
5 0.54459941 190 nips-2002-Stochastic Neighbor Embedding
Author: Geoffrey E. Hinton, Sam T. Roweis
Abstract: We describe a probabilistic approach to the task of placing objects, described by high-dimensional vectors or by pairwise dissimilarities, in a low-dimensional space in a way that preserves neighbor identities. A Gaussian is centered on each object in the high-dimensional space and the densities under this Gaussian (or the given dissimilarities) are used to define a probability distribution over all the potential neighbors of the object. The aim of the embedding is to approximate this distribution as well as possible when the same operation is performed on the low-dimensional “images” of the objects. A natural cost function is a sum of Kullback-Leibler divergences, one per object, which leads to a simple gradient for adjusting the positions of the low-dimensional images. Unlike other dimensionality reduction methods, this probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images. This allows ambiguous objects, like the document count vector for the word “bank”, to have versions close to the images of both “river” and “finance” without forcing the images of outdoor concepts to be located close to those of corporate concepts.
6 0.53916401 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers
7 0.50037706 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting
8 0.49458241 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
9 0.4527038 111 nips-2002-Independent Components Analysis through Product Density Estimation
10 0.45197663 114 nips-2002-Information Regularization with Partially Labeled Data
11 0.44201502 115 nips-2002-Informed Projections
12 0.43227139 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
13 0.42566803 178 nips-2002-Robust Novelty Detection with Single-Class MPM
14 0.40990135 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
15 0.38569659 87 nips-2002-Fast Transformation-Invariant Factor Analysis
16 0.38251117 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
17 0.38137317 18 nips-2002-Adaptation and Unsupervised Learning
18 0.37917 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
19 0.37523541 113 nips-2002-Information Diffusion Kernels
20 0.37487 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
topicId topicWeight
[(11, 0.013), (14, 0.011), (23, 0.014), (42, 0.399), (54, 0.176), (55, 0.022), (57, 0.012), (67, 0.015), (68, 0.043), (74, 0.064), (92, 0.03), (98, 0.109)]
simIndex simValue paperId paperTitle
1 0.99443948 22 nips-2002-Adaptive Nonlinear System Identification with Echo State Networks
Author: Herbert Jaeger
Abstract: Echo state networks (ESN) are a novel approach to recurrent neural network training. An ESN consists of a large, fixed, recurrent
2 0.99364835 115 nips-2002-Informed Projections
Author: David Tax
Abstract: Low rank approximation techniques are widespread in pattern recognition research — they include Latent Semantic Analysis (LSA), Probabilistic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. All make use of a low-dimensional manifold onto which data are projected. Such techniques are generally “unsupervised,” which allows them to model data in the absence of labels or categories. With many practical problems, however, some prior knowledge is available in the form of context. In this paper, I describe a principled approach to incorporating such information, and demonstrate its application to PCA-based approximations of several data sets. 1
3 0.98572999 181 nips-2002-Self Supervised Boosting
Author: Max Welling, Richard S. Zemel, Geoffrey E. Hinton
Abstract: Boosting algorithms and successful applications thereof abound for classification and regression learning problems, but not for unsupervised learning. We propose a sequential approach to adding features to a random field model by training them to improve classification performance between the data and an equal-sized sample of “negative examples” generated from the model’s current estimate of the data density. Training in each boosting round proceeds in three stages: first we sample negative examples from the model’s current Boltzmann distribution. Next, a feature is trained to improve classification performance between data and negative examples. Finally, a coefficient is learned which determines the importance of this feature relative to ones already in the pool. Negative examples only need to be generated once to learn each new feature. The validity of the approach is demonstrated on binary digits and continuous synthetic data.
4 0.98209929 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
Author: Christopher Williams, John S. Shawe-taylor
Abstract: In this paper we analyze the relationships between the eigenvalues of the m x m Gram matrix K for a kernel k(·, .) corresponding to a sample Xl, ... ,X m drawn from a density p(x) and the eigenvalues of the corresponding continuous eigenproblem. We bound the differences between the two spectra and provide a performance bound on kernel peA. 1
same-paper 5 0.93617904 138 nips-2002-Manifold Parzen Windows
Author: Pascal Vincent, Yoshua Bengio
Abstract: The similarity between objects is a fundamental element of many learning algorithms. Most non-parametric methods take this similarity to be fixed, but much recent work has shown the advantages of learning it, in particular to exploit the local invariances in the data or to capture the possibly non-linear manifold on which most of the data lies. We propose a new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices. Experiments in density estimation show significant improvements with respect to Parzen density estimators. The density estimators can also be used within Bayes classifiers, yielding classification rates similar to SVMs and much superior to the Parzen classifier.
6 0.80546808 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
7 0.80509549 46 nips-2002-Boosting Density Estimation
8 0.78362972 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
9 0.780734 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
10 0.77550656 169 nips-2002-Real-Time Particle Filters
11 0.76242632 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
12 0.75879294 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
13 0.75578356 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
14 0.75469422 3 nips-2002-A Convergent Form of Approximate Policy Iteration
15 0.74786329 190 nips-2002-Stochastic Neighbor Embedding
16 0.74457139 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
17 0.73652256 137 nips-2002-Location Estimation with a Differential Update Network
18 0.73619676 96 nips-2002-Generalized² Linear² Models
19 0.72859043 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
20 0.72747779 111 nips-2002-Independent Components Analysis through Product Density Estimation