nips nips2002 nips2002-117 knowledge-graph by maker-knowledge-mining

117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers


Source: pdf

Author: Balázs Kégl

Abstract: We propose a new algorithm to estimate the intrinsic dimension of data sets. The method is based on geometric properties of the data and requires neither parametric assumptions on the data generating model nor input parameters to set. The method is compared to a similar, widelyused algorithm from the same family of geometric techniques. Experiments show that our method is more robust in terms of the data generating distribution and more reliable in the presence of noise. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We propose a new algorithm to estimate the intrinsic dimension of data sets. [sent-4, score-0.38]

2 The method is based on geometric properties of the data and requires neither parametric assumptions on the data generating model nor input parameters to set. [sent-5, score-0.106]

3 The method is compared to a similar, widelyused algorithm from the same family of geometric techniques. [sent-6, score-0.064]

4 Experiments show that our method is more robust in terms of the data generating distribution and more reliable in the presence of noise. [sent-7, score-0.064]

5 The phenomenon that the computational and statistical efficiency of statistical techniques degrade rapidly with the dimension is often referred to as the “curse of dimensionality”. [sent-9, score-0.175]

6 One particular characteristic of high-dimensional spaces is that as the volumes of constant diameter neighborhoods become large, exponentially many points are needed for reliable density estimation. [sent-10, score-0.095]

7 Another important problem is that as the data dimension grows, sophisticated data structures constructed to speed up nearest neighbor searches rapidly become inefficient. [sent-11, score-0.212]

8 Rather, the data distributions are observed to concentrate to nonlinear manifolds of low intrinsic dimension. [sent-13, score-0.171]

9 Although most of these algorithms require that the intrinsic dimension of the manifold be explicitly set, there has been little effort devoted to design and analyze techniques that estimate the intrinsic dimension of data in this context. [sent-15, score-0.751]

10 There are two principal areas where a good estimate of the intrinsic dimension can be useful. [sent-16, score-0.37]

11 First, as mentioned before, the estimate can be used to set input parameters of dimension reduction algorithms. [sent-17, score-0.211]

12 , LLE and the ISOMAP algorithm) also require a scale parameter that determines the size of the local neighborhoods used in the algorithms. [sent-20, score-0.043]

13 In this case, it is useful if the dimension estimate is provided as a function of the scale (see Figure 1 for an intuitive example where the intrinsic dimension of the data depends on the resolution). [sent-21, score-0.558]

14 Nearest neighbor searching algorithms can also profit from a good dimension estimate. [sent-22, score-0.173]

15 , kd-trees and R-trees) increase exponentially with the dimension, and these methods become inefficient if the dimension is more than about 20. [sent-25, score-0.186]

16 [5] a that the complexity increases with the intrinsic dimension of the data rather then with the dimension of the embedding space. [sent-27, score-0.519]

17 (c) D Figure 1: Intrinsic dimension D at different resolutions. [sent-28, score-0.162]

18 (a) At very small scale the data looks zero-dimensional. [sent-29, score-0.041]

19 (b) If the scale is comparable to the PSfrag replacements noise level, the intrinsic dimension seems larger than expected. [sent-30, score-0.557]

20 (d) At very large scale the global dimension dominates. [sent-32, score-0.189]

21 1 (b) D 2 (d) D 2 (a) D 0 In this paper we present a novel method for intrinsic dimension estimation. [sent-33, score-0.319]

22 The estimate is based on geometric properties of the data, and requires no parameters to set. [sent-34, score-0.077]

23 Experimental results on both artificial and real data show that the algorithm is able to capture the scale dependence of the intrinsic dimension. [sent-35, score-0.209]

24 The main advantage of the method over existing techniques is its robustness in terms of the generating distribution. [sent-36, score-0.05]

25 In Section 2 we introduce the field of intrinsic dimension estimation, and give a short overview of existing approaches. [sent-38, score-0.319]

26 2 Intrinsic dimension estimation Informally, the intrinsic dimension of a random vector X is usually defined as the number of “independent” parameters needed to represent X. [sent-41, score-0.495]

27 Although in practice this informal notion seems to have a well-defined meaning, formally it is ambiguous due to the existence of space-filling curves. [sent-42, score-0.055]

28 So, instead of this informal notion, we turn to the classical concept of topological dimension, and define the intrinsic dimension of X as the topological dimension of the support of the distribution of X . [sent-43, score-0.597]

29 Given a topological space X , the covering of a subset S is a collection C of open subsets in X whose union contains S . [sent-45, score-0.118]

30 A refinement of a covering C of S is another covering C such that each set in C is contained in some set in C . [sent-46, score-0.121]

31 Definition 1 A subset S of a topological space X has topological dimension D top (also known as Lebesgue covering dimension) if every covering C of S has a refinement C in which every point of S belongs to at most (Dtop + 1) sets in C , and Dtop is the smallest such integer. [sent-48, score-0.363]

32 The main technical difficulty with the topological dimension is that it is computationally difficult to estimate on a finite sample. [sent-49, score-0.252]

33 Hence, practical methods use various other definitions of the intrinsic dimension. [sent-50, score-0.157]

34 It is common to categorize intrinsic dimension estimating methods into two classes, projection techniques and geometric approaches. [sent-51, score-0.4]

35 Projection techniques explicitly construct a mapping, and usually measure the dimension by using some variants of principal component analysis. [sent-52, score-0.19]

36 , n of data points drawn independently from the distribution of X, probably the most obvious way to estimate the intrinsic dimension is by looking at the eigenstructure of the covariance matrix C of Sn . [sent-59, score-0.398]

37 In addition, if the manifold is highly nonlinear, Dpca will characterize the global (intrinsic) dimension of the data rather than the local dimension of the manifold. [sent-62, score-0.388]

38 The advantages of the method are that it works well on non-linear data, and that it can produce dimension estimates at different resolutions. [sent-70, score-0.162]

39 The technique is similar in spirit to the way the dimension parameter of LLE is set in [3]. [sent-72, score-0.174]

40 The algorithm runs in O(n2 d) time (where n is the number of points and d is the embedding dimension) which is slightly worse than the O(nd Dpca ) complexity of the fast PCA algorithm of Roweis [7] when computing Dpca . [sent-73, score-0.063]

41 Another general scheme in the family of projection techniques is to turn the dimensionality reduction algorithm from an embedding technique into a probabilistic, generative model [8], and optimize the dimension as any other parameter by using cross-validation in a maximum likelihood setting. [sent-74, score-0.279]

42 The main disadvantage of this approach is that the dimension estimate depends on the generative model and the particular algorithm, so if the model does not fit the data or if the algorithm does not work well on the particular problem, the estimate can be invalid. [sent-75, score-0.286]

43 The second basic approach to intrinsic dimension estimation is based on geometric properties of the data rather then projection techniques. [sent-76, score-0.403]

44 Most of the geometric methods use the correlation dimension from the family of fractal dimensions due to the computational simplicity of its estimation. [sent-78, score-0.262]

45 , xn } of a metric space X , let Cn (r) = n n 2 ∑ ∑ I{ n(n − 1) i=1 j=i+1 xi −x j < r} whose union is a covering of S . [sent-83, score-0.123]

46 The following definition is based on the observation that the covering number N(r) of a D-dimensional set is proportional to r −D . [sent-84, score-0.055]

47 Definition 4 The capacity dimension of a subset S of a metric space X is Dcap = − lim r→0 log N(r) . [sent-85, score-0.249]

48 log r The principal advantage of Dcap over Dcorr is that Dcap does not depend on the data distribution on the manifold. [sent-86, score-0.055]

49 The main contribution of this paper is an efficient intrinsic dimension estimating method that is based on the capacity dimension. [sent-89, score-0.365]

50 Experiments on both synthetic and real data confirm that our method is much more robust in terms of the data distribution than methods based on the correlation dimension. [sent-90, score-0.044]

51 3 Algorithm Finding the covering number even of a finite set of data points is computationally difficult. [sent-91, score-0.086]

52 To tackle this problem, first we redefine Dcap by using packing numbers rather than covering numbers. [sent-92, score-0.135]

53 Given a metric space X with distance metric d(·, ·), a set V ⊂ X is said to be r-separated if d(x, y) ≥ r for all distinct x, y ∈ V . [sent-93, score-0.052]

54 The following proposition follows from the basic inequality between packing and covering numbers N(r) ≤ M(r) ≤ N(r/2). [sent-95, score-0.147]

55 log r For a finite sample, the zero limit cannot be achieved so, similarly to the correlation dimension, we need to redefine the capacity dimension in a scale-dependent manner. [sent-97, score-0.224]

56 Definition 5 The scale-dependent capacity dimension of a finite set Sn = {x1 , . [sent-98, score-0.182]

57 , xn } is Dcap (r1 , r2 ) = − log M(r2 ) − log M(r1 ) . [sent-101, score-0.069]

58 log r2 − log r1 Finding M(r) for a data set Sn = {x1 , . [sent-102, score-0.066]

59 , xn } is equivalent to finding the cardinality of a maximum independent vertex set MI(Gr ) of the graph Gr (V, E) with vertex set V = Sn and edge set E = {(xi , x j )|d(xi , x j ) < r}. [sent-105, score-0.083]

60 On the positive side, it was shown that for such geometric graphs as Gr , MI(G) can be approximated arbitrarily well by polynomial time algorithms [13]. [sent-108, score-0.041]

61 However, approximating algorithms of this kind scale exponentially with the data dimension both in terms of the quality of the approximation and the running time 1 so they are of little practical use for d > 2. [sent-109, score-0.216]

62 Given a data set S n , we start with an empty set of centers C , and in an iteration over Sn we add to C data points that are at a distance of at least r from all the centers in C (lines 4 to 10 in Figure 2). [sent-111, score-0.077]

63 The estimate M(r) is the cardinality of C after every point in Sn has been visited. [sent-112, score-0.064]

64 The procedure is designed to produce an r-packing but certainly underestimates the packing number of the manifold, first, because we are using a finite sample, and second, because in general M(r) < M(r). [sent-113, score-0.112]

65 To see why, observe that, for a good estimate for Dcap , it is enough if we can estimate M(r) with a constant multiplicative bias independent of r. [sent-115, score-0.094]

66 Although we have no formal proof that the bias of M(r) does not change with r, the simple greedy procedure described above seems to work well in practice. [sent-116, score-0.041]

67 Even though the bias of M(r) does not affect the estimation of Dcap as long as it does not change with r, the variance of M(r) can distort the dimension estimate. [sent-117, score-0.198]

68 The main source of the variance is the dependence of M(r) on the the order of the data points in which they are visited. [sent-118, score-0.045]

69 To eliminate this variance, we repeat the procedure several times on random permutations of the data, and compute the estimate Dpack by using the average of the logarithms of the packing numbers. [sent-119, score-0.116]

70 On the other hand, since the variance of the estimate also tends to be smaller at smaller scales, the algorithm iterates less for the same accuracy. [sent-124, score-0.061]

71 4 Experiments The two main objectives of the four experiments described here is to demonstrate the ability of the method to capture the scale-dependent behavior of the intrinsic dimension, and to underline its robustness in terms of the data generating distribution. [sent-125, score-0.221]

72 In all experiments, the estimate Dpack is compared to the correlation dimension estimate Dcorr . [sent-126, score-0.25]

73 , rm of resolutions, and the estimate is plotted halfway between the two parameters (i. [sent-130, score-0.053]

74 ) In the first three experiments the manifold is either known or can be approximated easily. [sent-133, score-0.063]

75 D generated 5000 points on a spiral-shaped We manifold with a small uniform perpendicular noise. [sent-139, score-0.067]

76 3 r Figure 3: Intrinsic dimension of (a) a spiral-shaped manifold and (b) hypercubes of different dimensions. [sent-157, score-0.23]

77 The more uneven the distribution, the more Dcorr underestimates Dtop while Dpack remains relatively stable. [sent-159, score-0.048]

78 The second set of experiments were designed to test how well the methods estimate the dimension of 5000 data points generated in hypercubes of dimensions two to six (Figure 3(b)). [sent-160, score-0.274]

79 The negative bias grows with the dimension, probably due to the fact that data sets of equal cardinality become sparser in a higher dimensional space. [sent-162, score-0.098]

80 (a) experiment shows 0 (a) D 0 (a) D 0 (a) D 0 (a) (a) D 0 (a) case (b) D 2 this 2 (b) D 2 (b) D 2 can 2 if the distribution is 2 (b) D 2 (b) D 2 of D corr , (b) D calibrating procedure (b) D fail (b) D 2 (b) D 2 (b) Dhighly non-uniform. [sent-164, score-0.528]

81 The2 first set 2is a sequence of 2481 snapshots pof2 a hand= 2 turning a p p p p p p p p D= 2 D= D 2 D= 2 D= D= 2 D= D= 2 D= D cup from pthe CMU 3database3 (Figure 4(a)). [sent-169, score-0.065]

82 p = The sequence of 3images =sweepsp = 3 curve in a p p p p p p p D= 3 D= 3 D= D= D= 3 D 3 D= 3 D= D 3 D ap = 5 4096-dimensional 5space pso its informal intrinsic dimension=is one. [sent-170, score-0.193]

83 p =Figure 5(a) shows p=5 p= =5 p=5 p=5 p=5 p 5 5 p=5 p=1 p=1 p=1 p=1 p=1 p=1 p=1 p=1 p=1 p=1 that at a small scale,8 both methods pfind a local dimension between 1 and 2. [sent-171, score-0.162]

84 r Our ,resultsD in pFigure 5(a) confirm againp = r D corr rvaries, pextensively that D , p = 3 D = r D , p =r D , p =r 3 D ,p=3 D p =r 3 , =r 3 D , p =r 3 D , p =r 3 D , 3 3 on pd= 3 D , pd= 3 D , pd= 3 D , pd= 3 D , pd= 3 D , pd= 6stable. [sent-175, score-0.528]

85 The data set contained 698 images of faces generated by using three free p=3 parameters: vertical and horizontal orientation, and light direction. [sent-177, score-0.05]

86 Figure 5(b) indicates p= that both estimates are reasonably close to the 5 informal intrinsic dimension. [sent-178, score-0.193]

87 p=8 (a) Turning cup Dcorr , p = 1 Dpack , p = 1 (b) ISOMAP faces Dcorr , p = 1 Dpack , p = 1 5 Dcorr , p = 3 Dpack Dcorr 4. [sent-179, score-0.045]

88 (a) Sequence of snapshots of a hand turning a cup. [sent-188, score-0.056]

89 5 r Figure 5: The intrinsic dimension of image data sets. [sent-208, score-0.333]

90 We found in all experiments that at a very small scale Dcorr tends to be higher than Dpack , 2 http://vasc. [sent-209, score-0.054]

91 html while Dpack tends to be more stable as the scale grows. [sent-213, score-0.041]

92 Hence, if the data contains very little noise and it is generated uniformly on the manifold, Dcorr seems to be closer to the “real” intrinsic dimension. [sent-214, score-0.202]

93 On the other hand, if the data contains noise (in which case at a very small scale we are estimating the dimension of the noise rather than the dimension of the manifold), or the distribution on the manifold is non-uniform, Dpack seems more reliable than Dcorr . [sent-215, score-0.473]

94 5 Conclusion We have presented a new algorithm to estimate the intrinsic dimension of data sets. [sent-216, score-0.38]

95 The method estimates the packing dimension of the data and requires neither parametric assumptions on the data generating model nor input parameters to set. [sent-217, score-0.307]

96 Experiments show that our method is more robust in terms of the data generating distribution and more reliable in the presence of noise. [sent-219, score-0.064]

97 , “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. [sent-238, score-0.058]

98 Sommer, “Intrinsic dimensionality estimation with optimally topology preserving maps,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. [sent-249, score-0.067]

99 Vinciarelli, “Estimating intrinsic dimension of data with a fractal-based approach,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, to appear. [sent-275, score-0.333]

100 Seidel, “Polynomial-time approximation schemes for geometric graphs,” in Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms SODA’01, 2001, pp. [sent-288, score-0.041]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pack', 0.545), ('corr', 0.528), ('dcorr', 0.332), ('dpack', 0.298), ('replacements', 0.192), ('dimension', 0.162), ('intrinsic', 0.157), ('psfrag', 0.145), ('dcap', 0.126), ('dpsfrag', 0.092), ('sn', 0.091), ('pp', 0.088), ('dtop', 0.08), ('packing', 0.08), ('pd', 0.076), ('dpca', 0.069), ('covering', 0.055), ('manifold', 0.05), ('geometric', 0.041), ('topological', 0.04), ('isomap', 0.038), ('informal', 0.036), ('estimate', 0.036), ('optm', 0.034), ('underestimates', 0.032), ('cardinality', 0.028), ('pca', 0.027), ('reliable', 0.027), ('scale', 0.027), ('log', 0.026), ('nition', 0.026), ('metric', 0.026), ('gr', 0.025), ('turning', 0.025), ('faces', 0.025), ('embedding', 0.024), ('mi', 0.024), ('bruske', 0.023), ('sommer', 0.023), ('generating', 0.023), ('ri', 0.022), ('lle', 0.022), ('bias', 0.022), ('capacity', 0.02), ('cup', 0.02), ('snapshots', 0.02), ('consecutive', 0.019), ('vertex', 0.019), ('seems', 0.019), ('hypercubes', 0.018), ('rede', 0.018), ('xn', 0.017), ('rm', 0.017), ('points', 0.017), ('dimensionality', 0.017), ('fractal', 0.017), ('roweis', 0.016), ('correlation', 0.016), ('neighborhoods', 0.016), ('uneven', 0.016), ('centers', 0.016), ('principal', 0.015), ('projection', 0.015), ('lim', 0.015), ('lattice', 0.015), ('cox', 0.015), ('data', 0.014), ('neither', 0.014), ('tends', 0.014), ('nement', 0.014), ('inef', 0.014), ('ch', 0.014), ('main', 0.014), ('estimation', 0.014), ('dimensions', 0.014), ('preserving', 0.013), ('reduction', 0.013), ('experiments', 0.013), ('exponentially', 0.013), ('xi', 0.013), ('disadvantage', 0.013), ('techniques', 0.013), ('nite', 0.013), ('family', 0.012), ('union', 0.012), ('uniformly', 0.012), ('technique', 0.012), ('optimally', 0.012), ('estimating', 0.012), ('probably', 0.012), ('proposition', 0.012), ('open', 0.011), ('sets', 0.011), ('hand', 0.011), ('topology', 0.011), ('contained', 0.011), ('spaces', 0.011), ('algorithm', 0.011), ('become', 0.011), ('neighbor', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers

Author: Balázs Kégl

Abstract: We propose a new algorithm to estimate the intrinsic dimension of data sets. The method is based on geometric properties of the data and requires neither parametric assumptions on the data generating model nor input parameters to set. The method is compared to a similar, widelyused algorithm from the same family of geometric techniques. Experiments show that our method is more robust in terms of the data generating distribution and more reliable in the presence of noise. 1

2 0.065791443 49 nips-2002-Charting a Manifold

Author: Matthew Brand

Abstract: We construct a nonlinear mapping from a high-dimensional sample space to a low-dimensional vector space, effectively recovering a Cartesian coordinate system for the manifold from which the data is sampled. The mapping preserves local geometric relations in the manifold and is pseudo-invertible. We show how to estimate the intrinsic dimensionality of the manifold from samples, decompose the sample data into locally linear low-dimensional patches, merge these patches into a single lowdimensional coordinate system, and compute forward and reverse mappings between the sample and coordinate spaces. The objective functions are convex and their solutions are given in closed form. 1 Nonlinear dimensionality reduction (NLDR) by charting Charting is the problem of assigning a low-dimensional coordinate system to data points in a high-dimensional sample space. It is presumed that the data lies on or near a lowdimensional manifold embedded in the sample space, and that there exists a 1-to-1 smooth nonlinear transform between the manifold and a low-dimensional vector space. The datamodeler’s goal is to estimate smooth continuous mappings between the sample and coordinate spaces. Often this analysis will shed light on the intrinsic variables of the datagenerating phenomenon, for example, revealing perceptual or configuration spaces. Our goal is to find a mapping—expressed as a kernel-based mixture of linear projections— that minimizes information loss about the density and relative locations of sample points. This constraint is expressed in a posterior that combines a standard gaussian mixture model (GMM) likelihood function with a prior that penalizes uncertainty due to inconsistent projections in the mixture. Section 3 develops a special case where this posterior is unimodal and maximizable in closed form, yielding a GMM whose covariances reveal a patchwork of overlapping locally linear subspaces that cover the manifold. Section 4 shows that for this (or any) GMM and a choice of reduced dimension d, there is a unique, closed-form solution for a minimally distorting merger of the subspaces into a d-dimensional coordinate space, as well as an reverse mapping defining the surface of the manifold in the sample space. The intrinsic dimensionality d of the data manifold can be estimated from the growth process of point-to-point distances. In analogy to differential geometry, we call the subspaces “charts” and their merger the “connection.” Section 5 considers example problems where these methods are used to untie knots, unroll and untwist sheets, and visualize video data. 1.1 Background Topology-neutral NLDR algorithms can be divided into those that compute mappings, and those that directly compute low-dimensional embeddings. The fi has its roots in mapeld ping algorithms: DeMers and Cottrell [3] proposed using auto-encoding neural networks with a hidden layer “ bottleneck,” effectively casting dimensionality reduction as a compression problem. Hastie defi principal curves [5] as nonparametric 1 D curves that pass ned through the center of “ nearby” data points. A rich literature has grown up around properly regularizing this approach and extending it to surfaces. Smola and colleagues [10] analyzed the NLDR problem in the broader framework of regularized quantization methods. More recent advances aim for embeddings: Gomes and Mojsilovic [4] treat manifold completion as an anisotropic diffusion problem, iteratively expanding points until they connect to their neighbors. The I SO M AP algorithm [12] represents remote distances as sums of a trusted set of distances between immediate neighbors, then uses multidimensional scaling to compute a low-dimensional embedding that minimally distorts all distances. The locally linear embedding algorithm (LLE) [9] represents each point as a weighted combination of a trusted set of nearest neighbors, then computes a minimally distorting low-dimensional barycentric embedding. They have complementary strengths: I SO M AP handles holes well but can fail if the data hull is nonconvex [12]; and vice versa for LLE [9]. Both offer embeddings without mappings. It has been noted that trusted-set methods are vulnerable to noise because they consider the subset of point-to-point relationships that has the lowest signal-to-noise ratio; small changes to the trusted set can induce large changes in the set of constraints on the embedding, making solutions unstable [1]. In a return to mapping, Roweis and colleagues [8] proposed global coordination— learning a mixture of locally linear projections from sample to coordinate space. They constructed a posterior that penalizes distortions in the mapping, and gave a expectation-maximization (EM) training rule. Innovative use of variational methods highlighted the diffi culty of even hill-climbing their multimodal posterior. Like [2, 7, 6, 8], the method we develop below is a decomposition of the manifold into locally linear neighborhoods. It bears closest relation to global coordination [8], although by a different construction of the problem, we avoid hill-climbing a spiky posterior and instead develop a closed-form solution. 2 Estimating locally linear scale and intrinsic dimensionality . We begin with matrix of sample points Y = [y1 , · · · , yN ], yn ∈ RD populating a Ddimensional sample space, and a conjecture that these points are samples from a manifold M of intrinsic dimensionality d < D. We seek a mapping onto a vector space . G(Y) → X = [x1 , · · · , xN ], xn ∈ Rd and 1-to-1 reverse mapping G−1 (X) → Y such that local relations between nearby points are preserved (this will be formalized below). The map G should be non-catastrophic, that is, without folds: Parallel lines on the manifold in RD should map to continuous smooth non-intersecting curves in Rd . This guarantees that linear operations on X such as interpolation will have reasonable analogues on Y. Smoothness means that at some scale r the mapping from a neighborhood on M to Rd is effectively linear. Consider a ball of radius r centered on a data point and containing n(r) data points. The count n(r) grows as rd , but only at the locally linear scale; the grow rate is inflated by isotropic noise at smaller scales and by embedding curvature at larger scales. . To estimate r, we look at how the r-ball grows as points are added to it, tracking c(r) = d d log n(r) log r. At noise scales, c(r) ≈ 1/D < 1/d, because noise has distributed points in all directions with equal probability. At the scale at which curvature becomes signifi cant, c(r) < 1/d, because the manifold is no longer perpendicular to the surface of the ball, so the ball does not have to grow as fast to accommodate new points. At the locally linear scale, the process peaks at c(r) = 1/d, because points are distributed only in the directions of the manifold’s local tangent space. The maximum of c(r) therefore gives an estimate of both the scale and the local dimensionality of the manifold (see fi gure 1), provided that the ball hasn’t expanded to a manifold boundary— boundaries have lower dimension than Scale behavior of a 1D manifold in 2-space Point−count growth process on a 2D manifold in 3−space 1 10 radial growth process 1D hypothesis 2D hypothesis 3D hypothesis radius (log scale) samples noise scale locally linear scale curvature scale 0 10 2 1 10 2 10 #points (log scale) 3 10 Figure 1: Point growth processes. L EFT: At the locally linear scale, the number of points in an r-ball grows as rd ; at noise and curvature scales it grows faster. R IGHT: Using the point-count growth process to fi the intrinsic dimensionality of a 2D manifold nonlinearly nd embedded in 3-space (see fi gure 2). Lines of slope 1/3 , 1/2 , and 1 are fi tted to sections of the log r/ log nr curve. For neighborhoods of radius r ≈ 1 with roughly n ≈ 10 points, the slope peaks at 1/2 indicating a dimensionality of d = 2. Below that, the data appears 3 D because it is dominated by noise (except for n ≤ D points); above, the data appears >2 D because of manifold curvature. As the r-ball expands to cover the entire data-set the dimensionality appears to drop to 1 as the process begins to track the 1D edges of the 2D sheet. the manifold. For low-dimensional manifolds such as sheets, the boundary submanifolds (edges and corners) are very small relative to the full manifold, so the boundary effect is typically limited to a small rise in c(r) as r approaches the scale of the entire data set. In practice, our code simply expands an r-ball at every point and looks for the fi peak in rst c(r), averaged over many nearby r-balls. One can estimate d and r globally or per-point. 3 Charting the data In the charting step we fi a soft partitioning of the data into locally linear low-dimensional nd neighborhoods, as a prelude to computing the connection that gives the global lowdimensional embedding. To minimize information loss in the connection, we require that the data points project into a subspace associated with each neighborhood with (1) minimal loss of local variance and (2) maximal agreement of the projections of nearby points into nearby neighborhoods. Criterion (1) is served by maximizing the likelihood function of a Gaussian mixture model (GMM) density fi tted to the data: . p(yi |µ, Σ) = ∑ j p(yi |µ j , Σ j ) p j = ∑ j N (yi ; µ j , Σ j ) p j . (1) Each gaussian component defi a local neighborhood centered around µ j with axes denes fi ned by the eigenvectors of Σ j . The amount of data variance along each axis is indicated by the eigenvalues of Σ j ; if the data manifold is locally linear in the vicinity of the µ j , all but the d dominant eigenvalues will be near-zero, implying that the associated eigenvectors constitute the optimal variance-preserving local coordinate system. To some degree likelihood maximization will naturally realize this property: It requires that the GMM components shrink in volume to fi the data as tightly as possible, which is best achieved by t positioning the components so that they “ pancake” onto locally flat collections of datapoints. However, this state of affairs is easily violated by degenerate (zero-variance) GMM components or components fi tted to overly small enough locales where the data density off the manifold is comparable to density on the manifold (e.g., at the noise scale). Consequently a prior is needed. Criterion (2) implies that neighboring partitions should have dominant axes that span similar subspaces, since disagreement (large subspace angles) would lead to inconsistent projections of a point and therefore uncertainty about its location in a low-dimensional coordinate space. The principal insight is that criterion (2) is exactly the cost of coding the location of a point in one neighborhood when it is generated by another neighborhood— the cross-entropy between the gaussian models defi ning the two neighborhoods: D(N1 N2 ) = = dy N (y; µ1 ,Σ1 ) log N (y; µ1 ,Σ1 ) N (y; µ2 ,Σ2 ) (log |Σ−1 Σ2 | + trace(Σ−1 Σ1 ) + (µ2 −µ1 ) Σ−1 (µ2 −µ1 ) − D)/2. (2) 1 2 2 Roughly speaking, the terms in (2) measure differences in size, orientation, and position, respectively, of two coordinate frames located at the means µ1 , µ2 with axes specifi by ed the eigenvectors of Σ1 , Σ2 . All three terms decline to zero as the overlap between the two frames is maximized. To maximize consistency between adjacent neighborhoods, we form . the prior p(µ, Σ) = exp[− ∑i= j mi (µ j )D(Ni N j )], where mi (µ j ) is a measure of co-locality. Unlike global coordination [8], we are not asking that the dominant axes in neighboring charts are aligned— only that they span nearly the same subspace. This is a much easier objective to satisfy, and it contains a useful special case where the posterior p(µ, Σ|Y) ∝ ∑i p(yi |µ, Σ)p(µ, Σ) is unimodal and can be maximized in closed form: Let us associate a gaussian neighborhood with each data-point, setting µi = yi ; take all neighborhoods to be a priori equally probable, setting pi = 1/N; and let the co-locality measure be determined from some local kernel. For example, in this paper we use mi (µ j ) ∝ N (µ j ; µi , σ2 ), with the scale parameter σ specifying the expected size of a neighborhood on the manifold in sample space. A reasonable choice is σ = r/2, so that 2erf(2) > 99.5% of the density of mi (µ j ) is contained in the area around yi where the manifold is expected to be locally linear. With uniform pi and µi , mi (µ j ) and fi xed, the MAP estimates of the GMM covariances are Σi = ∑ mi (µ j ) (y j − µi )(y j − µi ) + (µ j − µi )(µ j − µi ) + Σ j j ∑ mi (µ j ) (3) . j Note that each covariance Σi is dependent on all other Σ j . The MAP estimators for all covariances can be arranged into a set of fully constrained linear equations and solved exactly for their mutually optimal values. This key step brings nonlocal information about the manifold’s shape into the local description of each neighborhood, ensuring that adjoining neighborhoods have similar covariances and small angles between their respective subspaces. Even if a local subset of data points are dense in a direction perpendicular to the manifold, the prior encourages the local chart to orient parallel to the manifold as part of a globally optimal solution, protecting against a pathology noted in [8]. Equation (3) is easily adapted to give a reduced number of charts and/or charts centered on local centroids. 4 Connecting the charts We now build a connection for set of charts specifi as an arbitrary nondegenerate GMM. A ed GMM gives a soft partitioning of the dataset into neighborhoods of mean µk and covariance Σk . The optimal variance-preserving low-dimensional coordinate system for each neighborhood derives from its weighted principal component analysis, which is exactly specifi ed by the eigenvectors of its covariance matrix: Eigendecompose Vk Λk Vk ← Σk with eigen. values in descending order on the diagonal of Λk and let Wk = [Id , 0]Vk be the operator . th projecting points into the k local chart, such that local chart coordinate uki = Wk (yi − µk ) . and Uk = [uk1 , · · · , ukN ] holds the local coordinates of all points. Our goal is to sew together all charts into a globally consistent low-dimensional coordinate system. For each chart there will be a low-dimensional affi transform Gk ∈ R(d+1)×d ne that projects Uk into the global coordinate space. Summing over all charts, the weighted average of the projections of point yi into the low-dimensional vector space is W j (y − µ j ) 1 . x|y = ∑ G j j p j|y (y) . xi |yi = ∑ G j ⇒ u ji 1 j p j|y (yi ), (4) where pk|y (y) ∝ pk N (y; µk , Σk ), ∑k pk|y (y) = 1 is the probability that chart k generates point y. As pointed out in [8], if a point has nonzero probabilities in two charts, then there should be affi transforms of those two charts that map the point to the same place in a ne global coordinate space. We set this up as a weighted least-squares problem: . G = [G1 , · · · , GK ] = arg min uki 1 ∑ pk|y (yi )p j|y (yi ) Gk Gk ,G j i −Gj u ji 1 2 . (5) F Equation (5) generates a homogeneous set of equations that determines a solution up to an affi transform of G. There are two solution methods. First, let us temporarily anchor one ne neighborhood at the origin to fi this indeterminacy. This adds the constraint G1 = [I, 0] . x . To solve, defi indicator matrix Fk = [0, · · · , 0, I, 0, · · · , 0] with the identity mane . trix occupying the kth block, such that Gk = GFk . Let the diagonal of Pk = diag([pk|y (y1 ), · · · , pk|y (yN )]) record the per-point posteriors of chart k. The squared error of the connection is then a sum of of all patch-to-anchor and patch-to-patch inconsistencies: . E =∑ (GUk − k U1 0 2 )Pk P1 F + ∑ (GU j − GUk )P j Pk j=k 2 F ; . Uk = Fk Uk 1 . (6) Setting dE /dG = 0 and solving to minimize convex E gives −1 G = ∑ Uk P2 k k ∑ j=k P2 j Uk − ∑ ∑ Uk P2 P2 k 1 Uk P2 P2 U j k j k j=k U1 0 . (7) We now remove the dependence on a reference neighborhood G1 by rewriting equation 5, G = arg min ∑ j=k (GU j − GUk )P j Pk G 2 F = GQ 2 F = trace(GQQ G ) , (8) . where Q = ∑ j=k U j − Uk P j Pk . If we require that GG = I to prevent degenerate solutions, then equation (8) is solved (up to rotation in coordinate space) by setting G to the eigenvectors associated with the smallest eigenvalues of QQ . The eigenvectors can be computed effi ciently without explicitly forming QQ ; other numerical effi ciencies obtain by zeroing any vanishingly small probabilities in each Pk , yielding a sparse eigenproblem. A more interesting strategy is to numerically condition the problem by calculating the trailing eigenvectors of QQ + 1. It can be shown that this maximizes the posterior 2 p(G|Q) ∝ p(Q|G)p(G) ∝ e− GQ F e− G1 , where the prior p(G) favors a mapping G whose unit-norm rows are also zero-mean. This maximizes variance in each row of G and thereby spreads the projected points broadly and evenly over coordinate space. The solutions for MAP charts (equation (5)) and connection (equation (8)) can be applied to any well-fi tted mixture of gaussians/factors1 /PCAs density model; thus large eigenproblems can be avoided by connecting just a small number of charts that cover the data. 1 We thank reviewers for calling our attention to Teh & Roweis ([11]— in this volume), which shows how to connect a set of given local dimensionality reducers in a generalized eigenvalue problem that is related to equation (8). LLE, n=5 charting (projection onto coordinate space) charting best Isomap LLE, n=6 LLE, n=7 LLE, n=8 random subset of local charts XYZ view LLE, n=9 LLE, n=10 XZ view data (linked) embedding, XY view XY view original data reconstruction (back−projected coordinate grid) best LLE (regularized) Figure 2: The twisted curl problem. L EFT: Comparison of charting, I SO M AP, & LLE. 400 points are randomly sampled from the manifold with noise. Charting is the only method that recovers the original space without catastrophes (folding), albeit with some shear. R IGHT: The manifold is regularly sampled (with noise) to illustrate the forward and backward projections. Samples are shown linked into lines to help visualize the manifold structure. Coordinate axes of a random selection of charts are shown as bold lines. Connecting subsets of charts such as this will also give good mappings. The upper right quadrant shows various LLE results. At bottom we show the charting solution and the reconstructed (back-projected) manifold, which smooths out the noise. Once the connection is solved, equation (4) gives the forward projection of any point y down into coordinate space. There are several numerically distinct candidates for the backprojection: posterior mean, mode, or exact inverse. In general, there may not be a unique posterior mode and the exact inverse is not solvable in closed form (this is also true of [8]). Note that chart-wise projection defi a complementary density in coordinate space nes px|k (x) = N (x; Gk 0 1 , Gk [Id , 0]Λk [Id , 0] 0 0 0 Gk ). (9) Let p(y|x, k), used to map x into subspace k on the surface of the manifold, be a Dirac delta function whose mean is a linear function of x. Then the posterior mean back-projection is obtained by integrating out uncertainty over which chart generates x: y|x = ∑ pk|x (x) k µk + Wk Gk I 0 + x − Gk 0 1 , (10) where (·)+ denotes pseudo-inverse. In general, a back-projecting map should not reconstruct the original points. Instead, equation (10) generates a surface that passes through the weighted average of the µi of all the neighborhoods in which yi has nonzero probability, much like a principal curve passes through the center of each local group of points. 5 Experiments Synthetic examples: 400 2 D points were randomly sampled from a 2 D square and embedded in 3 D via a curl and twist, then contaminated with gaussian noise. Even if noiselessly sampled, this manifold cannot be “ unrolled” without distortion. In addition, the outer curl is sampled much less densely than the inner curl. With an order of magnitude fewer points, higher noise levels, no possibility of an isometric mapping, and uneven sampling, this is arguably a much more challenging problem than the “ swiss roll” and “ s-curve” problems featured in [12, 9, 8, 1]. Figure 2LEFT contrasts the (unique) output of charting and the best outputs obtained from I SO M AP and LLE (considering all neighborhood sizes between 2 and 20 points). I SO M AP and LLE show catastrophic folding; we had to change LLE’s b. data, yz view c. local charts d. 2D embedding e. 1D embedding 1D ordinate a. data, xy view true manifold arc length Figure 3: Untying a trefoil knot ( ) by charting. 900 noisy samples from a 3 D-embedded 1 D manifold are shown as connected dots in front (a) and side (b) views. A subset of charts is shown in (c). Solving for the 2 D connection gives the “ unknot” in (d). After removing some points to cut the knot, charting gives a 1 D embedding which we plot against true manifold arc length in (e); monotonicity (modulo noise) indicates correctness. Three principal degrees of freedom recovered from raw jittered images pose scale expression images synthesized via backprojection of straight lines in coordinate space Figure 4: Modeling the manifold of facial images from raw video. Each row contains images synthesized by back-projecting an axis-parallel straight line in coordinate space onto the manifold in image space. Blurry images correspond to points on the manifold whose neighborhoods contain few if any nearby data points. regularization in order to coax out nondegenerate (>1 D) solutions. Although charting is not designed for isometry, after affi transform the forward-projected points disagree with ne the original points with an RMS error of only 1.0429, lower than the best LLE (3.1423) or best I SO M AP (1.1424, not shown). Figure 2RIGHT shows the same problem where points are sampled regularly from a grid, with noise added before and after embedding. Figure 3 shows a similar treatment of a 1 D line that was threaded into a 3 D trefoil knot, contaminated with gaussian noise, and then “ untied” via charting. Video: We obtained a 1965-frame video sequence (courtesy S. Roweis and B. Frey) of 20 × 28-pixel images in which B.F. strikes a variety of poses and expressions. The video is heavily contaminated with synthetic camera jitters. We used raw images, though image processing could have removed this and other uninteresting sources of variation. We took a 500-frame subsequence and left-right mirrored it to obtain 1000 points in 20 × 28 = 560D image space. The point-growth process peaked just above d = 3 dimensions. We solved for 25 charts, each centered on a random point, and a 3D connection. The recovered degrees of freedom— recognizable as pose, scale, and expression— are visualized in fi gure 4. original data stereographic map to 3D fishbowl charting Figure 5: Flattening a fi shbowl. From the left: Original 2000×2D points; their stereographic mapping to a 3D fi shbowl; its 2D embedding recovered using 500 charts; and the stereographic map. Fewer charts lead to isometric mappings that fold the bowl (not shown). Conformality: Some manifolds can be flattened conformally (preserving local angles) but not isometrically. Figure 5 shows that if the data is fi nely charted, the connection behaves more conformally than isometrically. This problem was suggested by J. Tenenbaum. 6 Discussion Charting breaks kernel-based NLDR into two subproblems: (1) Finding a set of datacovering locally linear neighborhoods (“ charts” ) such that adjoining neighborhoods span maximally similar subspaces, and (2) computing a minimal-distortion merger (“ connection” ) of all charts. The solution to (1) is optimal w.r.t. the estimated scale of local linearity r; the solution to (2) is optimal w.r.t. the solution to (1) and the desired dimensionality d. Both problems have Bayesian settings. By offloading the nonlinearity onto the kernels, we obtain least-squares problems and closed form solutions. This scheme is also attractive because large eigenproblems can be avoided by using a reduced set of charts. The dependence on r, like trusted-set methods, is a potential source of solution instability. In practice the point-growth estimate seems fairly robust to data perturbations (to be expected if the data density changes slowly over a manifold of integral Hausdorff dimension), while the use of a soft neighborhood partitioning appears to make charting solutions reasonably stable to variations in r. Eigenvalue stability analyses may prove useful here. Ultimately, we would prefer to integrate r out. In contrast, use of d appears to be a virtue: Unlike other eigenvector-based methods, the best d-dimensional embedding is not merely a linear projection of the best d + 1-dimensional embedding; a unique distortion is found for each value of d that maximizes the information content of its embedding. Why does charting performs well on datasets where the signal-to-noise ratio confounds recent state-of-the-art methods? Two reasons may be adduced: (1) Nonlocal information is used to construct both the system of local charts and their global connection. (2) The mapping only preserves the component of local point-to-point distances that project onto the manifold; relationships perpendicular to the manifold are discarded. Thus charting uses global shape information to suppress noise in the constraints that determine the mapping. Acknowledgments Thanks to J. Buhmann, S. Makar, S. Roweis, J. Tenenbaum, and anonymous reviewers for insightful comments and suggested “ challenge” problems. References [1] M. Balasubramanian and E. L. Schwartz. The IsoMap algorithm and topological stability. Science, 295(5552):7, January 2002. [2] C. Bregler and S. Omohundro. Nonlinear image interpolation using manifold learning. In NIPS–7, 1995. [3] D. DeMers and G. Cottrell. Nonlinear dimensionality reduction. In NIPS–5, 1993. [4] J. Gomes and A. Mojsilovic. A variational approach to recovering a manifold from sample points. In ECCV, 2002. [5] T. Hastie and W. Stuetzle. Principal curves. J. Am. Statistical Assoc, 84(406):502–516, 1989. [6] G. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of handwritten digits. IEEE Trans. Neural Networks, 8, 1997. [7] N. Kambhatla and T. Leen. Dimensionality reduction by local principal component analysis. Neural Computation, 9, 1997. [8] S. Roweis, L. Saul, and G. Hinton. Global coordination of linear models. In NIPS–13, 2002. [9] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. [10] A. Smola, S. Mika, B. Schölkopf, and R. Williamson. Regularized principal manifolds. Machine Learning, 1999. [11] Y. W. Teh and S. T. Roweis. Automatic alignment of hidden representations. In NIPS–15, 2003. [12] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000.

3 0.065123111 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction

Author: Vin D. Silva, Joshua B. Tenenbaum

Abstract: Recently proposed algorithms for nonlinear dimensionality reduction fall broadly into two categories which have different advantages and disadvantages: global (Isomap [1]), and local (Locally Linear Embedding [2], Laplacian Eigenmaps [3]). We present two variants of Isomap which combine the advantages of the global approach with what have previously been exclusive advantages of local methods: computational sparsity and the ability to invert conformal maps.

4 0.048342422 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

Author: Tong Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert function Spaces. We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. Using this concept, we can naturally extend results for parametric estimation problems in finite dimensional spaces to non-parametric kernel learning methods. We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances.

5 0.044692643 36 nips-2002-Automatic Alignment of Local Representations

Author: Yee W. Teh, Sam T. Roweis

Abstract: We present an automatic alignment procedure which maps the disparate internal representations learned by several local dimensionality reduction experts into a single, coherent global coordinate system for the original data space. Our algorithm can be applied to any set of experts, each of which produces a low-dimensional local representation of a highdimensional input. Unlike recent efforts to coordinate such models by modifying their objective functions [1, 2], our algorithm is invoked after training and applies an efficient eigensolver to post-process the trained models. The post-processing has no local optima and the size of the system it must solve scales with the number of local models rather than the number of original data points, making it more efficient than model-free algorithms such as Isomap [3] or LLE [4]. 1 Introduction: Local vs. Global Dimensionality Reduction Beyond density modelling, an important goal of unsupervised learning is to discover compact, informative representations of high-dimensional data. If the data lie on a smooth low dimensional manifold, then an excellent encoding is the coordinates internal to that manifold. The process of determining such coordinates is dimensionality reduction. Linear dimensionality reduction methods such as principal component analysis and factor analysis are easy to train but cannot capture the structure of curved manifolds. Mixtures of these simple unsupervised models [5, 6, 7, 8] have been used to perform local dimensionality reduction, and can provide good density models for curved manifolds, but unfortunately such mixtures cannot do dimensionality reduction. They do not describe a single, coherent low-dimensional coordinate system for the data since there is no pressure for the local coordinates of each component to agree. Roweis et al [1] recently proposed a model which performs global coordination of local coordinate systems in a mixture of factor analyzers (MFA). Their model is trained by maximizing the likelihood of the data, with an additional variational penalty term to encourage the internal coordinates of the factor analyzers to agree. While their model can trade off modelling the data and having consistent local coordinate systems, it requires a user given trade-off parameter, training is quite inefficient (although [2] describes an improved training algorithm for a more constrained model), and it has quite serious local minima problems (methods like LLE [4] or Isomap [3] have to be used for initialization). In this paper we describe a novel, automatic way to align the hidden representations used by each component of a mixture of dimensionality reducers into a single global representation of the data throughout space. Given an already trained mixture, the alignment is achieved by applying an eigensolver to a matrix constructed from the internal representations of the mixture components. Our method is efficient, simple to implement, and has no local optima in its optimization nor any learning rates or annealing schedules. 2 The Locally Linear Coordination Algorithm H 9¥ EI¡ CD66B9 ©9B 766 % G F 5 #

6 0.042775609 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement

7 0.041310783 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

8 0.03137121 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits

9 0.031246694 138 nips-2002-Manifold Parzen Windows

10 0.029195921 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

11 0.028836031 98 nips-2002-Going Metric: Denoising Pairwise Data

12 0.028491324 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

13 0.028348153 194 nips-2002-The Decision List Machine

14 0.027604215 113 nips-2002-Information Diffusion Kernels

15 0.02733035 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

16 0.026876252 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

17 0.026246971 115 nips-2002-Informed Projections

18 0.025522128 124 nips-2002-Learning Graphical Models with Mercer Kernels

19 0.024676232 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories

20 0.022549098 190 nips-2002-Stochastic Neighbor Embedding


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.075), (1, -0.02), (2, -0.011), (3, 0.014), (4, -0.023), (5, 0.019), (6, -0.025), (7, -0.001), (8, -0.032), (9, 0.105), (10, 0.028), (11, 0.011), (12, 0.005), (13, -0.056), (14, 0.036), (15, 0.092), (16, -0.039), (17, 0.072), (18, -0.009), (19, -0.047), (20, -0.037), (21, 0.03), (22, 0.001), (23, -0.046), (24, 0.013), (25, 0.005), (26, -0.004), (27, 0.004), (28, 0.018), (29, -0.045), (30, -0.032), (31, 0.037), (32, 0.045), (33, -0.032), (34, 0.034), (35, 0.049), (36, 0.035), (37, -0.043), (38, -0.073), (39, 0.074), (40, 0.087), (41, 0.03), (42, -0.005), (43, -0.032), (44, 0.062), (45, -0.128), (46, -0.144), (47, 0.081), (48, -0.066), (49, -0.066)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92538756 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers

Author: Balázs Kégl

Abstract: We propose a new algorithm to estimate the intrinsic dimension of data sets. The method is based on geometric properties of the data and requires neither parametric assumptions on the data generating model nor input parameters to set. The method is compared to a similar, widelyused algorithm from the same family of geometric techniques. Experiments show that our method is more robust in terms of the data generating distribution and more reliable in the presence of noise. 1

2 0.60271358 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction

Author: Vin D. Silva, Joshua B. Tenenbaum

Abstract: Recently proposed algorithms for nonlinear dimensionality reduction fall broadly into two categories which have different advantages and disadvantages: global (Isomap [1]), and local (Locally Linear Embedding [2], Laplacian Eigenmaps [3]). We present two variants of Isomap which combine the advantages of the global approach with what have previously been exclusive advantages of local methods: computational sparsity and the ability to invert conformal maps.

3 0.57782763 49 nips-2002-Charting a Manifold

Author: Matthew Brand

Abstract: We construct a nonlinear mapping from a high-dimensional sample space to a low-dimensional vector space, effectively recovering a Cartesian coordinate system for the manifold from which the data is sampled. The mapping preserves local geometric relations in the manifold and is pseudo-invertible. We show how to estimate the intrinsic dimensionality of the manifold from samples, decompose the sample data into locally linear low-dimensional patches, merge these patches into a single lowdimensional coordinate system, and compute forward and reverse mappings between the sample and coordinate spaces. The objective functions are convex and their solutions are given in closed form. 1 Nonlinear dimensionality reduction (NLDR) by charting Charting is the problem of assigning a low-dimensional coordinate system to data points in a high-dimensional sample space. It is presumed that the data lies on or near a lowdimensional manifold embedded in the sample space, and that there exists a 1-to-1 smooth nonlinear transform between the manifold and a low-dimensional vector space. The datamodeler’s goal is to estimate smooth continuous mappings between the sample and coordinate spaces. Often this analysis will shed light on the intrinsic variables of the datagenerating phenomenon, for example, revealing perceptual or configuration spaces. Our goal is to find a mapping—expressed as a kernel-based mixture of linear projections— that minimizes information loss about the density and relative locations of sample points. This constraint is expressed in a posterior that combines a standard gaussian mixture model (GMM) likelihood function with a prior that penalizes uncertainty due to inconsistent projections in the mixture. Section 3 develops a special case where this posterior is unimodal and maximizable in closed form, yielding a GMM whose covariances reveal a patchwork of overlapping locally linear subspaces that cover the manifold. Section 4 shows that for this (or any) GMM and a choice of reduced dimension d, there is a unique, closed-form solution for a minimally distorting merger of the subspaces into a d-dimensional coordinate space, as well as an reverse mapping defining the surface of the manifold in the sample space. The intrinsic dimensionality d of the data manifold can be estimated from the growth process of point-to-point distances. In analogy to differential geometry, we call the subspaces “charts” and their merger the “connection.” Section 5 considers example problems where these methods are used to untie knots, unroll and untwist sheets, and visualize video data. 1.1 Background Topology-neutral NLDR algorithms can be divided into those that compute mappings, and those that directly compute low-dimensional embeddings. The fi has its roots in mapeld ping algorithms: DeMers and Cottrell [3] proposed using auto-encoding neural networks with a hidden layer “ bottleneck,” effectively casting dimensionality reduction as a compression problem. Hastie defi principal curves [5] as nonparametric 1 D curves that pass ned through the center of “ nearby” data points. A rich literature has grown up around properly regularizing this approach and extending it to surfaces. Smola and colleagues [10] analyzed the NLDR problem in the broader framework of regularized quantization methods. More recent advances aim for embeddings: Gomes and Mojsilovic [4] treat manifold completion as an anisotropic diffusion problem, iteratively expanding points until they connect to their neighbors. The I SO M AP algorithm [12] represents remote distances as sums of a trusted set of distances between immediate neighbors, then uses multidimensional scaling to compute a low-dimensional embedding that minimally distorts all distances. The locally linear embedding algorithm (LLE) [9] represents each point as a weighted combination of a trusted set of nearest neighbors, then computes a minimally distorting low-dimensional barycentric embedding. They have complementary strengths: I SO M AP handles holes well but can fail if the data hull is nonconvex [12]; and vice versa for LLE [9]. Both offer embeddings without mappings. It has been noted that trusted-set methods are vulnerable to noise because they consider the subset of point-to-point relationships that has the lowest signal-to-noise ratio; small changes to the trusted set can induce large changes in the set of constraints on the embedding, making solutions unstable [1]. In a return to mapping, Roweis and colleagues [8] proposed global coordination— learning a mixture of locally linear projections from sample to coordinate space. They constructed a posterior that penalizes distortions in the mapping, and gave a expectation-maximization (EM) training rule. Innovative use of variational methods highlighted the diffi culty of even hill-climbing their multimodal posterior. Like [2, 7, 6, 8], the method we develop below is a decomposition of the manifold into locally linear neighborhoods. It bears closest relation to global coordination [8], although by a different construction of the problem, we avoid hill-climbing a spiky posterior and instead develop a closed-form solution. 2 Estimating locally linear scale and intrinsic dimensionality . We begin with matrix of sample points Y = [y1 , · · · , yN ], yn ∈ RD populating a Ddimensional sample space, and a conjecture that these points are samples from a manifold M of intrinsic dimensionality d < D. We seek a mapping onto a vector space . G(Y) → X = [x1 , · · · , xN ], xn ∈ Rd and 1-to-1 reverse mapping G−1 (X) → Y such that local relations between nearby points are preserved (this will be formalized below). The map G should be non-catastrophic, that is, without folds: Parallel lines on the manifold in RD should map to continuous smooth non-intersecting curves in Rd . This guarantees that linear operations on X such as interpolation will have reasonable analogues on Y. Smoothness means that at some scale r the mapping from a neighborhood on M to Rd is effectively linear. Consider a ball of radius r centered on a data point and containing n(r) data points. The count n(r) grows as rd , but only at the locally linear scale; the grow rate is inflated by isotropic noise at smaller scales and by embedding curvature at larger scales. . To estimate r, we look at how the r-ball grows as points are added to it, tracking c(r) = d d log n(r) log r. At noise scales, c(r) ≈ 1/D < 1/d, because noise has distributed points in all directions with equal probability. At the scale at which curvature becomes signifi cant, c(r) < 1/d, because the manifold is no longer perpendicular to the surface of the ball, so the ball does not have to grow as fast to accommodate new points. At the locally linear scale, the process peaks at c(r) = 1/d, because points are distributed only in the directions of the manifold’s local tangent space. The maximum of c(r) therefore gives an estimate of both the scale and the local dimensionality of the manifold (see fi gure 1), provided that the ball hasn’t expanded to a manifold boundary— boundaries have lower dimension than Scale behavior of a 1D manifold in 2-space Point−count growth process on a 2D manifold in 3−space 1 10 radial growth process 1D hypothesis 2D hypothesis 3D hypothesis radius (log scale) samples noise scale locally linear scale curvature scale 0 10 2 1 10 2 10 #points (log scale) 3 10 Figure 1: Point growth processes. L EFT: At the locally linear scale, the number of points in an r-ball grows as rd ; at noise and curvature scales it grows faster. R IGHT: Using the point-count growth process to fi the intrinsic dimensionality of a 2D manifold nonlinearly nd embedded in 3-space (see fi gure 2). Lines of slope 1/3 , 1/2 , and 1 are fi tted to sections of the log r/ log nr curve. For neighborhoods of radius r ≈ 1 with roughly n ≈ 10 points, the slope peaks at 1/2 indicating a dimensionality of d = 2. Below that, the data appears 3 D because it is dominated by noise (except for n ≤ D points); above, the data appears >2 D because of manifold curvature. As the r-ball expands to cover the entire data-set the dimensionality appears to drop to 1 as the process begins to track the 1D edges of the 2D sheet. the manifold. For low-dimensional manifolds such as sheets, the boundary submanifolds (edges and corners) are very small relative to the full manifold, so the boundary effect is typically limited to a small rise in c(r) as r approaches the scale of the entire data set. In practice, our code simply expands an r-ball at every point and looks for the fi peak in rst c(r), averaged over many nearby r-balls. One can estimate d and r globally or per-point. 3 Charting the data In the charting step we fi a soft partitioning of the data into locally linear low-dimensional nd neighborhoods, as a prelude to computing the connection that gives the global lowdimensional embedding. To minimize information loss in the connection, we require that the data points project into a subspace associated with each neighborhood with (1) minimal loss of local variance and (2) maximal agreement of the projections of nearby points into nearby neighborhoods. Criterion (1) is served by maximizing the likelihood function of a Gaussian mixture model (GMM) density fi tted to the data: . p(yi |µ, Σ) = ∑ j p(yi |µ j , Σ j ) p j = ∑ j N (yi ; µ j , Σ j ) p j . (1) Each gaussian component defi a local neighborhood centered around µ j with axes denes fi ned by the eigenvectors of Σ j . The amount of data variance along each axis is indicated by the eigenvalues of Σ j ; if the data manifold is locally linear in the vicinity of the µ j , all but the d dominant eigenvalues will be near-zero, implying that the associated eigenvectors constitute the optimal variance-preserving local coordinate system. To some degree likelihood maximization will naturally realize this property: It requires that the GMM components shrink in volume to fi the data as tightly as possible, which is best achieved by t positioning the components so that they “ pancake” onto locally flat collections of datapoints. However, this state of affairs is easily violated by degenerate (zero-variance) GMM components or components fi tted to overly small enough locales where the data density off the manifold is comparable to density on the manifold (e.g., at the noise scale). Consequently a prior is needed. Criterion (2) implies that neighboring partitions should have dominant axes that span similar subspaces, since disagreement (large subspace angles) would lead to inconsistent projections of a point and therefore uncertainty about its location in a low-dimensional coordinate space. The principal insight is that criterion (2) is exactly the cost of coding the location of a point in one neighborhood when it is generated by another neighborhood— the cross-entropy between the gaussian models defi ning the two neighborhoods: D(N1 N2 ) = = dy N (y; µ1 ,Σ1 ) log N (y; µ1 ,Σ1 ) N (y; µ2 ,Σ2 ) (log |Σ−1 Σ2 | + trace(Σ−1 Σ1 ) + (µ2 −µ1 ) Σ−1 (µ2 −µ1 ) − D)/2. (2) 1 2 2 Roughly speaking, the terms in (2) measure differences in size, orientation, and position, respectively, of two coordinate frames located at the means µ1 , µ2 with axes specifi by ed the eigenvectors of Σ1 , Σ2 . All three terms decline to zero as the overlap between the two frames is maximized. To maximize consistency between adjacent neighborhoods, we form . the prior p(µ, Σ) = exp[− ∑i= j mi (µ j )D(Ni N j )], where mi (µ j ) is a measure of co-locality. Unlike global coordination [8], we are not asking that the dominant axes in neighboring charts are aligned— only that they span nearly the same subspace. This is a much easier objective to satisfy, and it contains a useful special case where the posterior p(µ, Σ|Y) ∝ ∑i p(yi |µ, Σ)p(µ, Σ) is unimodal and can be maximized in closed form: Let us associate a gaussian neighborhood with each data-point, setting µi = yi ; take all neighborhoods to be a priori equally probable, setting pi = 1/N; and let the co-locality measure be determined from some local kernel. For example, in this paper we use mi (µ j ) ∝ N (µ j ; µi , σ2 ), with the scale parameter σ specifying the expected size of a neighborhood on the manifold in sample space. A reasonable choice is σ = r/2, so that 2erf(2) > 99.5% of the density of mi (µ j ) is contained in the area around yi where the manifold is expected to be locally linear. With uniform pi and µi , mi (µ j ) and fi xed, the MAP estimates of the GMM covariances are Σi = ∑ mi (µ j ) (y j − µi )(y j − µi ) + (µ j − µi )(µ j − µi ) + Σ j j ∑ mi (µ j ) (3) . j Note that each covariance Σi is dependent on all other Σ j . The MAP estimators for all covariances can be arranged into a set of fully constrained linear equations and solved exactly for their mutually optimal values. This key step brings nonlocal information about the manifold’s shape into the local description of each neighborhood, ensuring that adjoining neighborhoods have similar covariances and small angles between their respective subspaces. Even if a local subset of data points are dense in a direction perpendicular to the manifold, the prior encourages the local chart to orient parallel to the manifold as part of a globally optimal solution, protecting against a pathology noted in [8]. Equation (3) is easily adapted to give a reduced number of charts and/or charts centered on local centroids. 4 Connecting the charts We now build a connection for set of charts specifi as an arbitrary nondegenerate GMM. A ed GMM gives a soft partitioning of the dataset into neighborhoods of mean µk and covariance Σk . The optimal variance-preserving low-dimensional coordinate system for each neighborhood derives from its weighted principal component analysis, which is exactly specifi ed by the eigenvectors of its covariance matrix: Eigendecompose Vk Λk Vk ← Σk with eigen. values in descending order on the diagonal of Λk and let Wk = [Id , 0]Vk be the operator . th projecting points into the k local chart, such that local chart coordinate uki = Wk (yi − µk ) . and Uk = [uk1 , · · · , ukN ] holds the local coordinates of all points. Our goal is to sew together all charts into a globally consistent low-dimensional coordinate system. For each chart there will be a low-dimensional affi transform Gk ∈ R(d+1)×d ne that projects Uk into the global coordinate space. Summing over all charts, the weighted average of the projections of point yi into the low-dimensional vector space is W j (y − µ j ) 1 . x|y = ∑ G j j p j|y (y) . xi |yi = ∑ G j ⇒ u ji 1 j p j|y (yi ), (4) where pk|y (y) ∝ pk N (y; µk , Σk ), ∑k pk|y (y) = 1 is the probability that chart k generates point y. As pointed out in [8], if a point has nonzero probabilities in two charts, then there should be affi transforms of those two charts that map the point to the same place in a ne global coordinate space. We set this up as a weighted least-squares problem: . G = [G1 , · · · , GK ] = arg min uki 1 ∑ pk|y (yi )p j|y (yi ) Gk Gk ,G j i −Gj u ji 1 2 . (5) F Equation (5) generates a homogeneous set of equations that determines a solution up to an affi transform of G. There are two solution methods. First, let us temporarily anchor one ne neighborhood at the origin to fi this indeterminacy. This adds the constraint G1 = [I, 0] . x . To solve, defi indicator matrix Fk = [0, · · · , 0, I, 0, · · · , 0] with the identity mane . trix occupying the kth block, such that Gk = GFk . Let the diagonal of Pk = diag([pk|y (y1 ), · · · , pk|y (yN )]) record the per-point posteriors of chart k. The squared error of the connection is then a sum of of all patch-to-anchor and patch-to-patch inconsistencies: . E =∑ (GUk − k U1 0 2 )Pk P1 F + ∑ (GU j − GUk )P j Pk j=k 2 F ; . Uk = Fk Uk 1 . (6) Setting dE /dG = 0 and solving to minimize convex E gives −1 G = ∑ Uk P2 k k ∑ j=k P2 j Uk − ∑ ∑ Uk P2 P2 k 1 Uk P2 P2 U j k j k j=k U1 0 . (7) We now remove the dependence on a reference neighborhood G1 by rewriting equation 5, G = arg min ∑ j=k (GU j − GUk )P j Pk G 2 F = GQ 2 F = trace(GQQ G ) , (8) . where Q = ∑ j=k U j − Uk P j Pk . If we require that GG = I to prevent degenerate solutions, then equation (8) is solved (up to rotation in coordinate space) by setting G to the eigenvectors associated with the smallest eigenvalues of QQ . The eigenvectors can be computed effi ciently without explicitly forming QQ ; other numerical effi ciencies obtain by zeroing any vanishingly small probabilities in each Pk , yielding a sparse eigenproblem. A more interesting strategy is to numerically condition the problem by calculating the trailing eigenvectors of QQ + 1. It can be shown that this maximizes the posterior 2 p(G|Q) ∝ p(Q|G)p(G) ∝ e− GQ F e− G1 , where the prior p(G) favors a mapping G whose unit-norm rows are also zero-mean. This maximizes variance in each row of G and thereby spreads the projected points broadly and evenly over coordinate space. The solutions for MAP charts (equation (5)) and connection (equation (8)) can be applied to any well-fi tted mixture of gaussians/factors1 /PCAs density model; thus large eigenproblems can be avoided by connecting just a small number of charts that cover the data. 1 We thank reviewers for calling our attention to Teh & Roweis ([11]— in this volume), which shows how to connect a set of given local dimensionality reducers in a generalized eigenvalue problem that is related to equation (8). LLE, n=5 charting (projection onto coordinate space) charting best Isomap LLE, n=6 LLE, n=7 LLE, n=8 random subset of local charts XYZ view LLE, n=9 LLE, n=10 XZ view data (linked) embedding, XY view XY view original data reconstruction (back−projected coordinate grid) best LLE (regularized) Figure 2: The twisted curl problem. L EFT: Comparison of charting, I SO M AP, & LLE. 400 points are randomly sampled from the manifold with noise. Charting is the only method that recovers the original space without catastrophes (folding), albeit with some shear. R IGHT: The manifold is regularly sampled (with noise) to illustrate the forward and backward projections. Samples are shown linked into lines to help visualize the manifold structure. Coordinate axes of a random selection of charts are shown as bold lines. Connecting subsets of charts such as this will also give good mappings. The upper right quadrant shows various LLE results. At bottom we show the charting solution and the reconstructed (back-projected) manifold, which smooths out the noise. Once the connection is solved, equation (4) gives the forward projection of any point y down into coordinate space. There are several numerically distinct candidates for the backprojection: posterior mean, mode, or exact inverse. In general, there may not be a unique posterior mode and the exact inverse is not solvable in closed form (this is also true of [8]). Note that chart-wise projection defi a complementary density in coordinate space nes px|k (x) = N (x; Gk 0 1 , Gk [Id , 0]Λk [Id , 0] 0 0 0 Gk ). (9) Let p(y|x, k), used to map x into subspace k on the surface of the manifold, be a Dirac delta function whose mean is a linear function of x. Then the posterior mean back-projection is obtained by integrating out uncertainty over which chart generates x: y|x = ∑ pk|x (x) k µk + Wk Gk I 0 + x − Gk 0 1 , (10) where (·)+ denotes pseudo-inverse. In general, a back-projecting map should not reconstruct the original points. Instead, equation (10) generates a surface that passes through the weighted average of the µi of all the neighborhoods in which yi has nonzero probability, much like a principal curve passes through the center of each local group of points. 5 Experiments Synthetic examples: 400 2 D points were randomly sampled from a 2 D square and embedded in 3 D via a curl and twist, then contaminated with gaussian noise. Even if noiselessly sampled, this manifold cannot be “ unrolled” without distortion. In addition, the outer curl is sampled much less densely than the inner curl. With an order of magnitude fewer points, higher noise levels, no possibility of an isometric mapping, and uneven sampling, this is arguably a much more challenging problem than the “ swiss roll” and “ s-curve” problems featured in [12, 9, 8, 1]. Figure 2LEFT contrasts the (unique) output of charting and the best outputs obtained from I SO M AP and LLE (considering all neighborhood sizes between 2 and 20 points). I SO M AP and LLE show catastrophic folding; we had to change LLE’s b. data, yz view c. local charts d. 2D embedding e. 1D embedding 1D ordinate a. data, xy view true manifold arc length Figure 3: Untying a trefoil knot ( ) by charting. 900 noisy samples from a 3 D-embedded 1 D manifold are shown as connected dots in front (a) and side (b) views. A subset of charts is shown in (c). Solving for the 2 D connection gives the “ unknot” in (d). After removing some points to cut the knot, charting gives a 1 D embedding which we plot against true manifold arc length in (e); monotonicity (modulo noise) indicates correctness. Three principal degrees of freedom recovered from raw jittered images pose scale expression images synthesized via backprojection of straight lines in coordinate space Figure 4: Modeling the manifold of facial images from raw video. Each row contains images synthesized by back-projecting an axis-parallel straight line in coordinate space onto the manifold in image space. Blurry images correspond to points on the manifold whose neighborhoods contain few if any nearby data points. regularization in order to coax out nondegenerate (>1 D) solutions. Although charting is not designed for isometry, after affi transform the forward-projected points disagree with ne the original points with an RMS error of only 1.0429, lower than the best LLE (3.1423) or best I SO M AP (1.1424, not shown). Figure 2RIGHT shows the same problem where points are sampled regularly from a grid, with noise added before and after embedding. Figure 3 shows a similar treatment of a 1 D line that was threaded into a 3 D trefoil knot, contaminated with gaussian noise, and then “ untied” via charting. Video: We obtained a 1965-frame video sequence (courtesy S. Roweis and B. Frey) of 20 × 28-pixel images in which B.F. strikes a variety of poses and expressions. The video is heavily contaminated with synthetic camera jitters. We used raw images, though image processing could have removed this and other uninteresting sources of variation. We took a 500-frame subsequence and left-right mirrored it to obtain 1000 points in 20 × 28 = 560D image space. The point-growth process peaked just above d = 3 dimensions. We solved for 25 charts, each centered on a random point, and a 3D connection. The recovered degrees of freedom— recognizable as pose, scale, and expression— are visualized in fi gure 4. original data stereographic map to 3D fishbowl charting Figure 5: Flattening a fi shbowl. From the left: Original 2000×2D points; their stereographic mapping to a 3D fi shbowl; its 2D embedding recovered using 500 charts; and the stereographic map. Fewer charts lead to isometric mappings that fold the bowl (not shown). Conformality: Some manifolds can be flattened conformally (preserving local angles) but not isometrically. Figure 5 shows that if the data is fi nely charted, the connection behaves more conformally than isometrically. This problem was suggested by J. Tenenbaum. 6 Discussion Charting breaks kernel-based NLDR into two subproblems: (1) Finding a set of datacovering locally linear neighborhoods (“ charts” ) such that adjoining neighborhoods span maximally similar subspaces, and (2) computing a minimal-distortion merger (“ connection” ) of all charts. The solution to (1) is optimal w.r.t. the estimated scale of local linearity r; the solution to (2) is optimal w.r.t. the solution to (1) and the desired dimensionality d. Both problems have Bayesian settings. By offloading the nonlinearity onto the kernels, we obtain least-squares problems and closed form solutions. This scheme is also attractive because large eigenproblems can be avoided by using a reduced set of charts. The dependence on r, like trusted-set methods, is a potential source of solution instability. In practice the point-growth estimate seems fairly robust to data perturbations (to be expected if the data density changes slowly over a manifold of integral Hausdorff dimension), while the use of a soft neighborhood partitioning appears to make charting solutions reasonably stable to variations in r. Eigenvalue stability analyses may prove useful here. Ultimately, we would prefer to integrate r out. In contrast, use of d appears to be a virtue: Unlike other eigenvector-based methods, the best d-dimensional embedding is not merely a linear projection of the best d + 1-dimensional embedding; a unique distortion is found for each value of d that maximizes the information content of its embedding. Why does charting performs well on datasets where the signal-to-noise ratio confounds recent state-of-the-art methods? Two reasons may be adduced: (1) Nonlocal information is used to construct both the system of local charts and their global connection. (2) The mapping only preserves the component of local point-to-point distances that project onto the manifold; relationships perpendicular to the manifold are discarded. Thus charting uses global shape information to suppress noise in the constraints that determine the mapping. Acknowledgments Thanks to J. Buhmann, S. Makar, S. Roweis, J. Tenenbaum, and anonymous reviewers for insightful comments and suggested “ challenge” problems. References [1] M. Balasubramanian and E. L. Schwartz. The IsoMap algorithm and topological stability. Science, 295(5552):7, January 2002. [2] C. Bregler and S. Omohundro. Nonlinear image interpolation using manifold learning. In NIPS–7, 1995. [3] D. DeMers and G. Cottrell. Nonlinear dimensionality reduction. In NIPS–5, 1993. [4] J. Gomes and A. Mojsilovic. A variational approach to recovering a manifold from sample points. In ECCV, 2002. [5] T. Hastie and W. Stuetzle. Principal curves. J. Am. Statistical Assoc, 84(406):502–516, 1989. [6] G. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of handwritten digits. IEEE Trans. Neural Networks, 8, 1997. [7] N. Kambhatla and T. Leen. Dimensionality reduction by local principal component analysis. Neural Computation, 9, 1997. [8] S. Roweis, L. Saul, and G. Hinton. Global coordination of linear models. In NIPS–13, 2002. [9] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. [10] A. Smola, S. Mika, B. Schölkopf, and R. Williamson. Regularized principal manifolds. Machine Learning, 1999. [11] Y. W. Teh and S. T. Roweis. Automatic alignment of hidden representations. In NIPS–15, 2003. [12] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000.

4 0.47834814 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

Author: Nicholas P. Hughes, David Lowe

Abstract: We consider the problem of illusory or artefactual structure from the visualisation of high-dimensional structureless data. In particular we examine the role of the distance metric in the use of topographic mappings based on the statistical field of multidimensional scaling. We show that the use of a squared Euclidean metric (i.e. the SS TRESS measure) gives rise to an annular structure when the input data is drawn from a highdimensional isotropic distribution, and we provide a theoretical justification for this observation.

5 0.41977447 36 nips-2002-Automatic Alignment of Local Representations

Author: Yee W. Teh, Sam T. Roweis

Abstract: We present an automatic alignment procedure which maps the disparate internal representations learned by several local dimensionality reduction experts into a single, coherent global coordinate system for the original data space. Our algorithm can be applied to any set of experts, each of which produces a low-dimensional local representation of a highdimensional input. Unlike recent efforts to coordinate such models by modifying their objective functions [1, 2], our algorithm is invoked after training and applies an efficient eigensolver to post-process the trained models. The post-processing has no local optima and the size of the system it must solve scales with the number of local models rather than the number of original data points, making it more efficient than model-free algorithms such as Isomap [3] or LLE [4]. 1 Introduction: Local vs. Global Dimensionality Reduction Beyond density modelling, an important goal of unsupervised learning is to discover compact, informative representations of high-dimensional data. If the data lie on a smooth low dimensional manifold, then an excellent encoding is the coordinates internal to that manifold. The process of determining such coordinates is dimensionality reduction. Linear dimensionality reduction methods such as principal component analysis and factor analysis are easy to train but cannot capture the structure of curved manifolds. Mixtures of these simple unsupervised models [5, 6, 7, 8] have been used to perform local dimensionality reduction, and can provide good density models for curved manifolds, but unfortunately such mixtures cannot do dimensionality reduction. They do not describe a single, coherent low-dimensional coordinate system for the data since there is no pressure for the local coordinates of each component to agree. Roweis et al [1] recently proposed a model which performs global coordination of local coordinate systems in a mixture of factor analyzers (MFA). Their model is trained by maximizing the likelihood of the data, with an additional variational penalty term to encourage the internal coordinates of the factor analyzers to agree. While their model can trade off modelling the data and having consistent local coordinate systems, it requires a user given trade-off parameter, training is quite inefficient (although [2] describes an improved training algorithm for a more constrained model), and it has quite serious local minima problems (methods like LLE [4] or Isomap [3] have to be used for initialization). In this paper we describe a novel, automatic way to align the hidden representations used by each component of a mixture of dimensionality reducers into a single global representation of the data throughout space. Given an already trained mixture, the alignment is achieved by applying an eigensolver to a matrix constructed from the internal representations of the mixture components. Our method is efficient, simple to implement, and has no local optima in its optimization nor any learning rates or annealing schedules. 2 The Locally Linear Coordination Algorithm H 9¥ EI¡ CD66B9 ©9B 766 % G F 5 #

6 0.39151919 190 nips-2002-Stochastic Neighbor Embedding

7 0.36399364 138 nips-2002-Manifold Parzen Windows

8 0.35857931 194 nips-2002-The Decision List Machine

9 0.34769806 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement

10 0.31105259 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

11 0.2931588 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

12 0.28828871 98 nips-2002-Going Metric: Denoising Pairwise Data

13 0.28120545 179 nips-2002-Scaling of Probability-Based Optimization Algorithms

14 0.27839318 47 nips-2002-Branching Law for Axons

15 0.27112383 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

16 0.26112333 114 nips-2002-Information Regularization with Partially Labeled Data

17 0.26103798 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

18 0.25788614 96 nips-2002-Generalized² Linear² Models

19 0.24575445 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus

20 0.24546067 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.017), (23, 0.013), (42, 0.052), (54, 0.156), (55, 0.02), (64, 0.315), (67, 0.012), (68, 0.043), (74, 0.069), (86, 0.013), (90, 0.061), (92, 0.032), (98, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83709961 206 nips-2002-Visual Development Aids the Acquisition of Motion Velocity Sensitivities

Author: Robert A. Jacobs, Melissa Dominguez

Abstract: We consider the hypothesis that systems learning aspects of visual perception may benefit from the use of suitably designed developmental progressions during training. Four models were trained to estimate motion velocities in sequences of visual images. Three of the models were “developmental models” in the sense that the nature of their input changed during the course of training. They received a relatively impoverished visual input early in training, and the quality of this input improved as training progressed. One model used a coarse-to-multiscale developmental progression (i.e. it received coarse-scale motion features early in training and finer-scale features were added to its input as training progressed), another model used a fine-to-multiscale progression, and the third model used a random progression. The final model was nondevelopmental in the sense that the nature of its input remained the same throughout the training period. The simulation results show that the coarse-to-multiscale model performed best. Hypotheses are offered to account for this model’s superior performance. We conclude that suitably designed developmental sequences can be useful to systems learning to estimate motion velocities. The idea that visual development can aid visual learning is a viable hypothesis in need of further study.

same-paper 2 0.7864123 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers

Author: Balázs Kégl

Abstract: We propose a new algorithm to estimate the intrinsic dimension of data sets. The method is based on geometric properties of the data and requires neither parametric assumptions on the data generating model nor input parameters to set. The method is compared to a similar, widelyused algorithm from the same family of geometric techniques. Experiments show that our method is more robust in terms of the data generating distribution and more reliable in the presence of noise. 1

3 0.74297786 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

Author: Patrik O. Hoyer, Aapo Hyvärinen

Abstract: The responses of cortical sensory neurons are notoriously variable, with the number of spikes evoked by identical stimuli varying significantly from trial to trial. This variability is most often interpreted as ‘noise’, purely detrimental to the sensory system. In this paper, we propose an alternative view in which the variability is related to the uncertainty, about world parameters, which is inherent in the sensory stimulus. Specifically, the responses of a population of neurons are interpreted as stochastic samples from the posterior distribution in a latent variable model. In addition to giving theoretical arguments supporting such a representational scheme, we provide simulations suggesting how some aspects of response variability might be understood in this framework.

4 0.58749741 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex

Author: Jarmo Hurri, Aapo Hyvärinen

Abstract: We show that two important properties of the primary visual cortex emerge when the principle of temporal coherence is applied to natural image sequences. The properties are simple-cell-like receptive fields and complex-cell-like pooling of simple cell outputs, which emerge when we apply two different approaches to temporal coherence. In the first approach we extract receptive fields whose outputs are as temporally coherent as possible. This approach yields simple-cell-like receptive fields (oriented, localized, multiscale). Thus, temporal coherence is an alternative to sparse coding in modeling the emergence of simple cell receptive fields. The second approach is based on a two-layer statistical generative model of natural image sequences. In addition to modeling the temporal coherence of individual simple cells, this model includes inter-cell temporal dependencies. Estimation of this model from natural data yields both simple-cell-like receptive fields, and complex-cell-like pooling of simple cell outputs. In this completely unsupervised learning, both layers of the generative model are estimated simultaneously from scratch. This is a significant improvement on earlier statistical models of early vision, where only one layer has been learned, and others have been fixed a priori.

5 0.55331886 10 nips-2002-A Model for Learning Variance Components of Natural Images

Author: Yan Karklin, Michael S. Lewicki

Abstract: We present a hierarchical Bayesian model for learning efficient codes of higher-order structure in natural images. The model, a non-linear generalization of independent component analysis, replaces the standard assumption of independence for the joint distribution of coefficients with a distribution that is adapted to the variance structure of the coefficients of an efficient image basis. This offers a novel description of higherorder image structure and provides a way to learn coarse-coded, sparsedistributed representations of abstract image properties such as object location, scale, and texture.

6 0.54797381 2 nips-2002-A Bilinear Model for Sparse Coding

7 0.54397404 136 nips-2002-Linear Combinations of Optic Flow Vectors for Estimating Self-Motion - a Real-World Test of a Neural Model

8 0.53254706 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

9 0.51998705 29 nips-2002-Analysis of Information in Speech Based on MANOVA

10 0.51926959 173 nips-2002-Recovering Intrinsic Images from a Single Image

11 0.51628011 148 nips-2002-Morton-Style Factorial Coding of Color in Primary Visual Cortex

12 0.51483059 43 nips-2002-Binary Coding in Auditory Cortex

13 0.51315069 190 nips-2002-Stochastic Neighbor Embedding

14 0.51273566 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals

15 0.51128441 113 nips-2002-Information Diffusion Kernels

16 0.51003283 119 nips-2002-Kernel Dependency Estimation

17 0.50945711 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

18 0.50823367 53 nips-2002-Clustering with the Fisher Score

19 0.50646585 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons

20 0.50616735 124 nips-2002-Learning Graphical Models with Mercer Kernels