nips nips2003 nips2003-149 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Denis V. Chigirev, William Bialek
Abstract: We introduce an information theoretic method for nonparametric, nonlinear dimensionality reduction, based on the infinite cluster limit of rate distortion theory. By constraining the information available to manifold coordinates, a natural probabilistic map emerges that assigns original data to corresponding points on a lower dimensional manifold. With only the information-distortion trade off as a parameter, our method determines the shape of the manifold, its dimensionality, the probabilistic map and the prior that provide optimal description of the data. 1 A simple example Some data sets may not be as complicated as they appear. Consider the set of points on a plane in Figure 1. As a two dimensional set, it requires a two dimensional density ρ(x, y) for its description. Since the data are sparse the density will be almost singular. We may use a smoothing kernel, but then the data set will be described by a complicated combination of troughs and peaks with no obvious pattern and hence no ability to generalize. We intuitively, however, see a strong one dimensional structure (a curve) underlying the data. In this paper we attempt to capture this intuition formally, through the use of the infinite cluster limit of rate distortion theory. Any set of points can be embedded in a hypersurface of any intrinsic dimensionality if we allow that hypersurface to be highly “folded.” For example, in Figure 1, any curve that goes through all the points gives a one dimensional representation. We would like to avoid such solutions, since they do not help us discover structure in the data. Looking for a simpler description one may choose to penalize the curvature term [1]. The problem with this approach is that it is not easily generalized to multiple dimensions, and requires the dimensionality of the solution as an input. An alternative approach is to allow curves of all shapes and sizes, but to send the reduced coordinates through an information bottleneck. With a fixed number of bits, position along a highly convoluted curve becomes uncertain. This will penalize curves that follow the data too closely (see Figure 1). There are several advantages to this approach. First, it removes the artificiality introduced by Hastie [2] of adding to the cost function only orthogonal errors. If we believe that data points fall out of the manifold due to noise, there is no reason to treat the projection onto the manifold as exact. Second, it does not require the dimension- 9 8 Figure 1: Rate distortion curve for a data set of 25 points (red). We used 1000 points to represent the curve which where initialized by scattering them uniformly on the plane. Note that the produced curve is well defined, one dimensional and smooth. 7 6 5 4 3 2 1 0 2 4 6 8 10 12 ality of the solution manifold as an input. By adding extra dimensions, one quickly looses the precision with which manifold points are specified (due to the fixed information bottleneck). Hence, the optimal dimension emerges naturally. This also means that the method works well in many dimensions with no adjustments. Third, the method handles sparse data well. This is important since in high dimensional spaces all data sets are sparse, i.e. they look like points in Figure 1, and the density estimation becomes impossible. Luckily, if the data are truly generated by a lower dimensional process, then density estimation in the data space is not important (from the viewpoint of prediction or any other). What is critical is the density of the data along the manifold (known in latent variable modeling as a prior), and our algorithm finds it naturally. 2 Latent variable models and dimensionality reduction Recently, the problem of reducing the dimensionality of a data set has received renewed attention [3,4]. The underlying idea, due to Hotelling [5], is that most of the variation in many high dimensional data sets can often be explained by a few latent variables. Alternatively, we say that rather than filling the whole space, the data lie on a lower dimensional manifold. The dimensionality of this manifold is the dimensionality of the latent space and the coordinate system on this manifold provides the latent variables. Traditional tools of principal component analysis (PCA) and factor analysis (FA) are still the most widely used methods in data analysis. They project the data onto a hyperplane, so the reduced coordinates are easy to interpret. However, these methods are unable to deal with nonlinear correlations in a data set. To accommodate nonlinearity in a data set, one has to relax the assumption that the data is modeled by a hyperplane, and allow a general low dimensional manifold of unknown shape and dimensionality. The same questions that we asked in the previous section apply here. What do we mean by requiring that “the manifold models the data well”? In the next section, we formalize this notion by defining the manifold description of data as a doublet (the shape of the manifold and the projection map). Note that we do not require the probability distribution over the manifold (known for generative models [6,7] as a prior distribution over the latent variables and postulated a priori). It is completely determined by the doublet. Nonlinear correlations in data can also be accommodated implicitly, without constructing an actual low dimensional manifold. By mapping the data from the original space to an even higher dimensional feature space, we may hope that the correlations will become linearized and PCA will apply. Kernel methods [8] allow us to do this without actually constructing an explicit map to feature space. They introduce nonlinearity through an a priori nonlinear kernel. Alternatively, autoassociative neural networks [9] force the data through a bottleneck (with an internal layer of desired dimensionality) to produce a reduced description. One of the disadvantages of these methods is that the results are not easy to interpret. Recent attempts to describe a data set with a low dimensional representation generally follow into two categories: spectral methods and density modeling methods. Spectral methods (LLE [3], ISOMAP [4], Laplacian eigenmaps [10]) give reduced coordinates of an a priori dimensionality by introducing a quadratic cost function in reduced coordinates (hence eigenvectors are solutions) that mimics the relationships between points in the original data space (geodesic distance for ISOMAP, linear reconstruction for LLE). Density modeling methods (GTM [6], GMM [7]) are generative models that try to reproduce the data with fewer variables. They require a prior and a parametric generative model to be introduced a priori and then find optimal parameters via maximum likelihood. The approach that we will take is inspired by the work of Kramer [9] and others who tried to formulate dimensionality reduction as a compression problem. They tried to solve the problem by building an explicit neural network encoder-decoder system which restricted the information implicitly by limiting the number of nodes in the bottleneck layer. Extending their intuition with the tools of information theory, we recast dimensionality reduction as a compression problem where the bottleneck is the information available to manifold coordinates. This allows us to define the optimal manifold description as that which produces the best reconstruction of the original data set, given that the coordinates can only be transmitted through a channel of fixed capacity. 3 Dimensionality reduction as compression Suppose that we have a data set X in a high dimensional state space RD described by a density function ρ(x). We would like to find a “simplified” description of this data set. One may do so by visualizing a lower dimensional manifold M that “almost” describes the data. If we have a manifold M and a stochastic map PM : x → PM (µ|x) to points µ on the manifold, we will say that they provide a manifold description of the data set X. Note that the stochastic map here is well justified: if a data point does not lie exactly on the manifold then we should expect some uncertainty in the estimation of the value of its latent variables. Also note that we do not need to specify the inverse (generative) map: M → RD ; it can be obtained by Bayes’ rule. The manifold description (M, PM ) is a less than faithful representation of the data. To formalize this notion we will introduce the distortion measure D(M, PM , ρ): ρ(x)PM (µ|x) x − µ 2 dD xDµ. D(M, PM , ρ) = x∈RD (1) µ∈M Here we have assumed the Euclidean distance function for simplicity. The stochastic map, PM (µ|x), together with the density, ρ(x), define a joint probability function P (M, X) that allows us to calculate the mutual information between the data and its manifold representation: I(X, M) = P (x, µ) log x∈X µ∈M P (x, µ) dD xDµ. ρ(x)PM (µ) (2) This quantity tells us how many bits (on average) are required to encode x into µ. If we view the manifold representation of X as a compression scheme, then I(X, M) tells us the necessary capacity of the channel needed to transmit the compressed data. Ideally, we would like to obtain a manifold description {M, PM (M|X)} of the data set X that provides both a low distortion D(M, PM , ρ) and a good compression (i.e. small I(X, M)). The more bits we are willing to provide for the description of the data, the more detailed a manifold that can be constructed. So there is a trade off between how faithful a manifold representation can be and how much information is required for its description. To formalize this notion we introduce the concept of an optimal manifold. DEFINITION. Given a data set X and a channel capacity I, a manifold description (M, PM (M|X)) that minimizes the distortion D(M, PM , X), and requires only information I for representing an element of X, will be called an optimal manifold M(I, X). Note that another way to define an optimal manifold is to require that the information I(M, X) is minimized while the average distortion is fixed at value D. The shape and the dimensionality of optimal manifold depends on our information resolution (or the description length that we are willing to allow). This dependence captures our intuition that for real world, multi-scale data, a proper manifold representation must reflect the compression level we are trying to achieve. To find the optimal manifold (M(I), PM(I) ) for a given data set X, we must solve a constrained optimization problem. Let us introduce a Lagrange multiplier λ that represents the trade off between information and distortion. Then optimal manifold M(I) minimizes the functional: F(M, PM ) = D + λI. (3) Let us parametrize the manifold M by t (presumably t ∈ Rd for some d ≤ D). The function γ(t) : t → M maps the points from the parameter space onto the manifold and therefore describes the manifold. Our equations become: D = dD x dd t ρ(x)P (t|x) x − γ(t) 2 , I = dD x dd t ρ(x)P (t|x) log P (t|x) , P (t) F(γ(t), P (t|x)) = D + λI. (4) (5) (6) Note that both information and distortion measures are properties of the manifold description doublet {M, PM (M|X)} and are invariant under reparametrization. We require the variations of the functional to vanish for optimal manifolds δF/δγ(t) = 0 and δF/δP (t|x) = 0, to obtain the following set of self consistent equations: P (t) = γ(t) = P (t|x) = Π(x) = dD x ρ(x)P (t|x), 1 dD x xρ(x)P (t|x), P (t) P (t) − 1 x−γ (t) 2 e λ , Π(x) 2 1 dd t P (t)e− λ x−γ (t) . (7) (8) (9) (10) In practice we do not have the full density ρ(x), but only a discrete number of samples. 1 So we have to approximate ρ(x) = N δ(x − xi ), where N is the number of samples, i is the sample label, and xi is the multidimensional vector describing the ith sample. Similarly, instead of using a continuous variable t we use a discrete set t ∈ {t1 , t2 , ..., tK } of K points to model the manifold. Note that in (7 − 10) the variable t appears only as an argument for other functions, so we can replace the integral over t by a sum over k = 1..K. Then P (t|x) becomes Pk (xi ),γ(t) is now γ k , and P (t) is Pk . The solution to the resulting set of equations in discrete variables (11 − 14) can be found by an iterative Blahut-Arimoto procedure [11] with an additional EM-like step. Here (n) denotes the iteration step, and α is a coordinate index in RD . The iteration scheme becomes: (n) Pk (n) γk,α = = N 1 N (n) Pk (xi ) = Π(n) (xi ) N 1 1 (n) N P k where α (11) i=1 = (n) xi,α Pk (xi ), (12) i=1 1, . . . , D, K (n) 1 (n) Pk e− λ xi −γ k 2 (13) k=1 (n) (n+1) Pk (xi ) = (n) 2 Pk 1 . e− λ xi −γ k (n) (x ) Π i (14) 0 0 One can initialize γk and Pk (xi ) by choosing K points at random from the data set and 0 letting γk = xi(k) and Pk = 1/K, then use equations (13) and (14) to initialize the 0 association map Pk (xi ). The iteration procedure (11 − 14) is terminated once n−1 n max |γk − γk | < , (15) k where determines the precision with which the manifold points are located. The above algorithm requires the information distortion cost λ = −δD/δI as a parameter. If we want to find the manifold description (M, P (M|X)) for a particular value of information I, we can plot the curve I(λ) and, because it’s monotonic, we can easily find the solution iteratively, arbitrarily close to a given value of I. 4 Evaluating the solution The result of our algorithm is a collection of K manifold points, γk ∈ M ⊂ RD , and a stochastic projection map, Pk (xi ), which maps the points from the data space onto the manifold. Presumably, the manifold M has a well defined intrinsic dimensionality d. If we imagine a little ball of radius r centered at some point on the manifold of intrinsic dimensionality d, and then we begin to grow the ball, the number of points on the manifold that fall inside will scale as rd . On the other hand, this will not be necessarily true for the original data set, since it is more spread out and resembles locally the whole embedding space RD . The Grassberger-Procaccia algorithm [12] captures this intuition by calculating the correlation dimension. First, calculate the correlation integral: 2 C(r) = N (N − 1) N N H(r − |xi − xj |), (16) i=1 j>i where H(x) is a step function with H(x) = 1 for x > 0 and H(x) = 0 for x < 0. This measures the probability that any two points fall within the ball of radius r. Then define 0 original data manifold representation -2 ln C(r) -4 -6 -8 -10 -12 -14 -5 -4 -3 -2 -1 0 1 2 3 4 ln r Figure 2: The semicircle. (a) N = 3150 points randomly scattered around a semicircle of radius R = 20 by a normal process with σ = 1 and the final positions of 100 manifold points. (b) Log log plot of C(r) vs r for both the manifold points (squares) and the original data set (circles). the correlation dimension at length scale r as the slope on the log log plot. dcorr (r) = d log C(r) . d log r (17) For points lying on a manifold the slope remains constant and the dimensionality is fixed, while the correlation dimension of the original data set quickly approaches that of the embedding space as we decrease the length scale. Note that the slope at large length scales always tends to decrease due to finite span of the data and curvature effects and therefore does not provide a reliable estimator of intrinsic dimensionality. 5 5.1 Examples Semi-Circle We have randomly generated N = 3150 data points scattered by a normal distribution with σ = 1 around a semi-circle of radius R = 20 (Figure 2a). Then we ran the algorithm with K = 100 and λ = 8, and terminated the iterative algorithm once the precision = 0.1 had been reached. The resulting manifold is depicted in red. To test the quality of our solution, we calculated the correlation dimension as a function of spatial scale for both the manifold points and the original data set (Figure 2b). As one can see, the manifold solution is of fixed dimensionality (the slope remains constant), while the original data set exhibits varying dimensionality. One should also note that the manifold points have dcorr (r) = 1 well into the territory where the original data set becomes two dimensional. This is what we should expect: at a given information level (in this case, I = 2.8 bits), the information about the second (local) degree of freedom is lost, and the resulting structure is one dimensional. A note about the parameters. Letting K → ∞ does not alter the solution. The information I and distortion D remain the same, and the additional points γk also fall on the semi-circle and are simple interpolations between the original manifold points. This allows us to claim that what we have found is a manifold, and not an agglomeration of clustering centers. Second, varying λ changes the information resolution I(λ): for small λ (high information rate) the local structure becomes important. At high information rate the solution undergoes 3.5 3 3 3 2.5 2.5 2 2.5 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0 0.5 -0.5 0 0 -1 5 -0.5 -0.5 4 1 3 0.5 2 -1 -1 0 1 -0.5 0 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Figure 3: S-shaped sheet in 3D. (a) N = 2000 random points on a surface of an S-shaped sheet in 3D. (b) Normal noise added. XY-plane projection of the data. (c) Optimal manifold points in 3D, projected onto an XY plane for easy visualization. a phase transition, and the resulting manifold becomes two dimensional to take into account the local structure. Alternatively, if we take λ → ∞, the cost of information rate becomes very high and the whole manifold collapses to a single point (becomes zero dimensional). 5.2 S-surface Here we took N = 2000 points covering an S-shaped sheet in three dimensions (Figure 3a), and then scattered the position of each point by adding Gaussian noise. The resulting manifold is difficult to visualize in three dimensions, so we provided its projection onto an XY plane for an illustrative purpose (Figure 3b). After running our algorithm we have recovered the original structure of the manifold (Figure 3c). 6 Discussion The problem of finding low dimensional manifolds in high dimensional data requires regularization to avoid hgihly folded, Peano curve like solutions which are low dimensional in the mathematical sense but fail to capture our geometric intuition. Rather than constraining geometrical features of the manifold (e.g., the curvature) we have constrained the mutual information between positions on the manifold and positions in the original data space, and this is invariant to all invertible coordinate transformations in either space. This approach enforces “smoothness” of the manifold only implicitly, but nonetheless seems to work. Our information theoretic approach has considerable generality relative to methods based on specific smoothing criteria, but requires a separate algorithm, such as LLE, to give the manifold points curvilinear coordinates. For data points not in the original data set, equations (9-10) and (13-14) provide the mapping onto the manifold. Eqn. (7) gives the probability distribution over the latent variable, known in the density modeling literature as “the prior.” The running time of the algorithm is linear in N . This compares favorably with other methods and makes it particularly attractive for very large data sets. The number of manifold points K usually is chosen as large as possible, given the computational constraints, to have a dense sampling of the manifold. However, a value of K << N is often sufficient, since D(λ, K) → D(λ) and I(λ, K) → I(λ) approach their limits rather quickly (the convergence improves for large λ and deteriorates for small λ). In the example of a semi-circle, the value of K = 30 was sufficient at the compression level of I = 2.8 bits. In general, the threshold value for K scales exponentially with the latent dimensionality (rather than with the dimensionality of the embedding space). The choice of λ depends on the desired information resolution, since I depends on λ. Ideally, one should plot the function I(λ) and then choose the region of interest. I(λ) is a monotonically decreasing function, with the kinks corresponding to phase transitions where the optimal manifold abruptly changes its dimensionality. In practice, we may want to run the algorithm only for a few choices of λ, and we would like to start with values that are most likely to correspond to a low dimensional latent variable representation. In this case, as a rule of thumb, we choose λ smaller, but on the order of the largest linear dimension (i.e. λ/2 ∼ Lmax ). The dependence of the optimal manifold M(I) on information resolution reflects the multi-scale nature of the data and should not be taken as a shortcoming. References [1] Bregler, C. & Omohundro, S. (1995) Nonlinear image interpolation using manifold learning. Advances in Neural Information Processing Systems 7. MIT Press. [2] Hastie, T. & Stuetzle, W. (1989) Principal curves. Journal of the American Statistical Association, 84(406), 502-516. [3] Roweis, S. & Saul, L. (2000) Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326. [4] Tenenbaum, J., de Silva, V., & Langford, J. (2000) A global geometric framework for nonlinear dimensionality reduction. Science, 290 , 2319–2323. [5] Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441,498-520. [6] Bishop, C., Svensen, M. & Williams, C. (1998) GTM: The generative topographic mapping. Neural Computation,10, 215–234. [7] Brand, M. (2003) Charting a manifold. Advances in Neural Information Processing Systems 15. MIT Press. [8] Scholkopf, B., Smola, A. & Muller K-R. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319. [9] Kramer, M. (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37, 233-243. [10] Belkin M. & Niyogi P. (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396. [11] Blahut, R. (1972) Computation of channel capacity and rate distortion function. IEEE Trans. Inform. Theory, IT-18, 460-473. [12] Grassberger, P., & Procaccia, I. (1983) Characterization of strange attractors. Physical Review Letters, 50, 346-349.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We introduce an information theoretic method for nonparametric, nonlinear dimensionality reduction, based on the infinite cluster limit of rate distortion theory. [sent-2, score-0.454]
2 By constraining the information available to manifold coordinates, a natural probabilistic map emerges that assigns original data to corresponding points on a lower dimensional manifold. [sent-3, score-1.173]
3 With only the information-distortion trade off as a parameter, our method determines the shape of the manifold, its dimensionality, the probabilistic map and the prior that provide optimal description of the data. [sent-4, score-0.254]
4 Consider the set of points on a plane in Figure 1. [sent-6, score-0.131]
5 As a two dimensional set, it requires a two dimensional density ρ(x, y) for its description. [sent-7, score-0.294]
6 In this paper we attempt to capture this intuition formally, through the use of the infinite cluster limit of rate distortion theory. [sent-11, score-0.214]
7 Any set of points can be embedded in a hypersurface of any intrinsic dimensionality if we allow that hypersurface to be highly “folded. [sent-12, score-0.413]
8 ” For example, in Figure 1, any curve that goes through all the points gives a one dimensional representation. [sent-13, score-0.267]
9 Looking for a simpler description one may choose to penalize the curvature term [1]. [sent-15, score-0.15]
10 The problem with this approach is that it is not easily generalized to multiple dimensions, and requires the dimensionality of the solution as an input. [sent-16, score-0.19]
11 If we believe that data points fall out of the manifold due to noise, there is no reason to treat the projection onto the manifold as exact. [sent-22, score-1.766]
12 Second, it does not require the dimension- 9 8 Figure 1: Rate distortion curve for a data set of 25 points (red). [sent-23, score-0.335]
13 We used 1000 points to represent the curve which where initialized by scattering them uniformly on the plane. [sent-24, score-0.15]
14 Note that the produced curve is well defined, one dimensional and smooth. [sent-25, score-0.168]
15 7 6 5 4 3 2 1 0 2 4 6 8 10 12 ality of the solution manifold as an input. [sent-26, score-0.77]
16 By adding extra dimensions, one quickly looses the precision with which manifold points are specified (due to the fixed information bottleneck). [sent-27, score-0.901]
17 This is important since in high dimensional spaces all data sets are sparse, i. [sent-31, score-0.16]
18 they look like points in Figure 1, and the density estimation becomes impossible. [sent-33, score-0.196]
19 Luckily, if the data are truly generated by a lower dimensional process, then density estimation in the data space is not important (from the viewpoint of prediction or any other). [sent-34, score-0.263]
20 What is critical is the density of the data along the manifold (known in latent variable modeling as a prior), and our algorithm finds it naturally. [sent-35, score-1.008]
21 2 Latent variable models and dimensionality reduction Recently, the problem of reducing the dimensionality of a data set has received renewed attention [3,4]. [sent-36, score-0.444]
22 The underlying idea, due to Hotelling [5], is that most of the variation in many high dimensional data sets can often be explained by a few latent variables. [sent-37, score-0.266]
23 Alternatively, we say that rather than filling the whole space, the data lie on a lower dimensional manifold. [sent-38, score-0.183]
24 The dimensionality of this manifold is the dimensionality of the latent space and the coordinate system on this manifold provides the latent variables. [sent-39, score-2.056]
25 They project the data onto a hyperplane, so the reduced coordinates are easy to interpret. [sent-41, score-0.181]
26 To accommodate nonlinearity in a data set, one has to relax the assumption that the data is modeled by a hyperplane, and allow a general low dimensional manifold of unknown shape and dimensionality. [sent-43, score-1.032]
27 What do we mean by requiring that “the manifold models the data well”? [sent-45, score-0.787]
28 In the next section, we formalize this notion by defining the manifold description of data as a doublet (the shape of the manifold and the projection map). [sent-46, score-1.774]
29 Note that we do not require the probability distribution over the manifold (known for generative models [6,7] as a prior distribution over the latent variables and postulated a priori). [sent-47, score-0.895]
30 Nonlinear correlations in data can also be accommodated implicitly, without constructing an actual low dimensional manifold. [sent-49, score-0.216]
31 By mapping the data from the original space to an even higher dimensional feature space, we may hope that the correlations will become linearized and PCA will apply. [sent-50, score-0.238]
32 They introduce nonlinearity through an a priori nonlinear kernel. [sent-52, score-0.145]
33 Alternatively, autoassociative neural networks [9] force the data through a bottleneck (with an internal layer of desired dimensionality) to produce a reduced description. [sent-53, score-0.189]
34 Recent attempts to describe a data set with a low dimensional representation generally follow into two categories: spectral methods and density modeling methods. [sent-55, score-0.305]
35 The approach that we will take is inspired by the work of Kramer [9] and others who tried to formulate dimensionality reduction as a compression problem. [sent-59, score-0.288]
36 Extending their intuition with the tools of information theory, we recast dimensionality reduction as a compression problem where the bottleneck is the information available to manifold coordinates. [sent-61, score-1.162]
37 This allows us to define the optimal manifold description as that which produces the best reconstruction of the original data set, given that the coordinates can only be transmitted through a channel of fixed capacity. [sent-62, score-1.049]
38 3 Dimensionality reduction as compression Suppose that we have a data set X in a high dimensional state space RD described by a density function ρ(x). [sent-63, score-0.344]
39 One may do so by visualizing a lower dimensional manifold M that “almost” describes the data. [sent-65, score-0.861]
40 If we have a manifold M and a stochastic map PM : x → PM (µ|x) to points µ on the manifold, we will say that they provide a manifold description of the data set X. [sent-66, score-1.786]
41 Note that the stochastic map here is well justified: if a data point does not lie exactly on the manifold then we should expect some uncertainty in the estimation of the value of its latent variables. [sent-67, score-0.972]
42 The manifold description (M, PM ) is a less than faithful representation of the data. [sent-69, score-0.883]
43 To formalize this notion we will introduce the distortion measure D(M, PM , ρ): ρ(x)PM (µ|x) x − µ 2 dD xDµ. [sent-70, score-0.21]
44 The stochastic map, PM (µ|x), together with the density, ρ(x), define a joint probability function P (M, X) that allows us to calculate the mutual information between the data and its manifold representation: I(X, M) = P (x, µ) log x∈X µ∈M P (x, µ) dD xDµ. [sent-72, score-0.839]
45 If we view the manifold representation of X as a compression scheme, then I(X, M) tells us the necessary capacity of the channel needed to transmit the compressed data. [sent-74, score-0.958]
46 Ideally, we would like to obtain a manifold description {M, PM (M|X)} of the data set X that provides both a low distortion D(M, PM , ρ) and a good compression (i. [sent-75, score-1.11]
47 The more bits we are willing to provide for the description of the data, the more detailed a manifold that can be constructed. [sent-78, score-0.903]
48 So there is a trade off between how faithful a manifold representation can be and how much information is required for its description. [sent-79, score-0.854]
49 Given a data set X and a channel capacity I, a manifold description (M, PM (M|X)) that minimizes the distortion D(M, PM , X), and requires only information I for representing an element of X, will be called an optimal manifold M(I, X). [sent-82, score-1.871]
50 Note that another way to define an optimal manifold is to require that the information I(M, X) is minimized while the average distortion is fixed at value D. [sent-83, score-0.925]
51 The shape and the dimensionality of optimal manifold depends on our information resolution (or the description length that we are willing to allow). [sent-84, score-1.127]
52 This dependence captures our intuition that for real world, multi-scale data, a proper manifold representation must reflect the compression level we are trying to achieve. [sent-85, score-0.89]
53 To find the optimal manifold (M(I), PM(I) ) for a given data set X, we must solve a constrained optimization problem. [sent-86, score-0.826]
54 Then optimal manifold M(I) minimizes the functional: F(M, PM ) = D + λI. [sent-88, score-0.783]
55 (3) Let us parametrize the manifold M by t (presumably t ∈ Rd for some d ≤ D). [sent-89, score-0.744]
56 The function γ(t) : t → M maps the points from the parameter space onto the manifold and therefore describes the manifold. [sent-90, score-0.894]
57 Our equations become: D = dD x dd t ρ(x)P (t|x) x − γ(t) 2 , I = dD x dd t ρ(x)P (t|x) log P (t|x) , P (t) F(γ(t), P (t|x)) = D + λI. [sent-91, score-0.398]
58 (4) (5) (6) Note that both information and distortion measures are properties of the manifold description doublet {M, PM (M|X)} and are invariant under reparametrization. [sent-92, score-1.013]
59 e− λ xi −γ k (n) (x ) Π i (14) 0 0 One can initialize γk and Pk (xi ) by choosing K points at random from the data set and 0 letting γk = xi(k) and Pk = 1/K, then use equations (13) and (14) to initialize the 0 association map Pk (xi ). [sent-110, score-0.312]
60 The iteration procedure (11 − 14) is terminated once n−1 n max |γk − γk | < , (15) k where determines the precision with which the manifold points are located. [sent-111, score-0.909]
61 The above algorithm requires the information distortion cost λ = −δD/δI as a parameter. [sent-112, score-0.142]
62 If we want to find the manifold description (M, P (M|X)) for a particular value of information I, we can plot the curve I(λ) and, because it’s monotonic, we can easily find the solution iteratively, arbitrarily close to a given value of I. [sent-113, score-0.898]
63 4 Evaluating the solution The result of our algorithm is a collection of K manifold points, γk ∈ M ⊂ RD , and a stochastic projection map, Pk (xi ), which maps the points from the data space onto the manifold. [sent-114, score-1.022]
64 Presumably, the manifold M has a well defined intrinsic dimensionality d. [sent-115, score-0.958]
65 If we imagine a little ball of radius r centered at some point on the manifold of intrinsic dimensionality d, and then we begin to grow the ball, the number of points on the manifold that fall inside will scale as rd . [sent-116, score-2.032]
66 On the other hand, this will not be necessarily true for the original data set, since it is more spread out and resembles locally the whole embedding space RD . [sent-117, score-0.149]
67 This measures the probability that any two points fall within the ball of radius r. [sent-120, score-0.235]
68 Then define 0 original data manifold representation -2 ln C(r) -4 -6 -8 -10 -12 -14 -5 -4 -3 -2 -1 0 1 2 3 4 ln r Figure 2: The semicircle. [sent-121, score-0.864]
69 (a) N = 3150 points randomly scattered around a semicircle of radius R = 20 by a normal process with σ = 1 and the final positions of 100 manifold points. [sent-122, score-0.997]
70 (b) Log log plot of C(r) vs r for both the manifold points (squares) and the original data set (circles). [sent-123, score-0.966]
71 the correlation dimension at length scale r as the slope on the log log plot. [sent-124, score-0.173]
72 d log r (17) For points lying on a manifold the slope remains constant and the dimensionality is fixed, while the correlation dimension of the original data set quickly approaches that of the embedding space as we decrease the length scale. [sent-126, score-1.301]
73 Note that the slope at large length scales always tends to decrease due to finite span of the data and curvature effects and therefore does not provide a reliable estimator of intrinsic dimensionality. [sent-127, score-0.186]
74 1 Examples Semi-Circle We have randomly generated N = 3150 data points scattered by a normal distribution with σ = 1 around a semi-circle of radius R = 20 (Figure 2a). [sent-129, score-0.271]
75 To test the quality of our solution, we calculated the correlation dimension as a function of spatial scale for both the manifold points and the original data set (Figure 2b). [sent-133, score-1.0]
76 As one can see, the manifold solution is of fixed dimensionality (the slope remains constant), while the original data set exhibits varying dimensionality. [sent-134, score-1.076]
77 One should also note that the manifold points have dcorr (r) = 1 well into the territory where the original data set becomes two dimensional. [sent-135, score-1.023]
78 The information I and distortion D remain the same, and the additional points γk also fall on the semi-circle and are simple interpolations between the original manifold points. [sent-140, score-1.083]
79 (a) N = 2000 random points on a surface of an S-shaped sheet in 3D. [sent-169, score-0.164]
80 (c) Optimal manifold points in 3D, projected onto an XY plane for easy visualization. [sent-172, score-0.926]
81 a phase transition, and the resulting manifold becomes two dimensional to take into account the local structure. [sent-173, score-0.898]
82 Alternatively, if we take λ → ∞, the cost of information rate becomes very high and the whole manifold collapses to a single point (becomes zero dimensional). [sent-174, score-0.833]
83 2 S-surface Here we took N = 2000 points covering an S-shaped sheet in three dimensions (Figure 3a), and then scattered the position of each point by adding Gaussian noise. [sent-176, score-0.262]
84 The resulting manifold is difficult to visualize in three dimensions, so we provided its projection onto an XY plane for an illustrative purpose (Figure 3b). [sent-177, score-0.864]
85 After running our algorithm we have recovered the original structure of the manifold (Figure 3c). [sent-178, score-0.794]
86 6 Discussion The problem of finding low dimensional manifolds in high dimensional data requires regularization to avoid hgihly folded, Peano curve like solutions which are low dimensional in the mathematical sense but fail to capture our geometric intuition. [sent-179, score-0.53]
87 Rather than constraining geometrical features of the manifold (e. [sent-180, score-0.777]
88 , the curvature) we have constrained the mutual information between positions on the manifold and positions in the original data space, and this is invariant to all invertible coordinate transformations in either space. [sent-182, score-0.915]
89 This approach enforces “smoothness” of the manifold only implicitly, but nonetheless seems to work. [sent-183, score-0.744]
90 Our information theoretic approach has considerable generality relative to methods based on specific smoothing criteria, but requires a separate algorithm, such as LLE, to give the manifold points curvilinear coordinates. [sent-184, score-0.909]
91 For data points not in the original data set, equations (9-10) and (13-14) provide the mapping onto the manifold. [sent-185, score-0.32]
92 (7) gives the probability distribution over the latent variable, known in the density modeling literature as “the prior. [sent-187, score-0.196]
93 The number of manifold points K usually is chosen as large as possible, given the computational constraints, to have a dense sampling of the manifold. [sent-190, score-0.843]
94 In general, the threshold value for K scales exponentially with the latent dimensionality (rather than with the dimensionality of the embedding space). [sent-194, score-0.467]
95 I(λ) is a monotonically decreasing function, with the kinks corresponding to phase transitions where the optimal manifold abruptly changes its dimensionality. [sent-197, score-0.783]
96 In practice, we may want to run the algorithm only for a few choices of λ, and we would like to start with values that are most likely to correspond to a low dimensional latent variable representation. [sent-198, score-0.276]
97 The dependence of the optimal manifold M(I) on information resolution reflects the multi-scale nature of the data and should not be taken as a shortcoming. [sent-202, score-0.864]
98 (2000) A global geometric framework for nonlinear dimensionality reduction. [sent-219, score-0.219]
99 (2003) Laplacian eigenmaps for dimensionality reduction and data representation. [sent-243, score-0.287]
100 (1972) Computation of channel capacity and rate distortion function. [sent-246, score-0.253]
wordName wordTfidf (topN-words)
[('manifold', 0.744), ('pm', 0.278), ('pk', 0.19), ('dd', 0.167), ('dimensionality', 0.164), ('distortion', 0.142), ('dimensional', 0.117), ('latent', 0.106), ('points', 0.099), ('rd', 0.095), ('description', 0.077), ('compression', 0.076), ('sheet', 0.065), ('bottleneck', 0.063), ('density', 0.06), ('map', 0.057), ('scattered', 0.056), ('nonlinear', 0.055), ('xi', 0.055), ('coordinates', 0.054), ('onto', 0.051), ('curve', 0.051), ('bits', 0.05), ('radius', 0.05), ('intrinsic', 0.05), ('autoassociative', 0.05), ('dcorr', 0.05), ('doublet', 0.05), ('hotelling', 0.05), ('hypersurface', 0.05), ('original', 0.05), ('lle', 0.05), ('slope', 0.049), ('fall', 0.048), ('trade', 0.048), ('reduction', 0.048), ('formalize', 0.046), ('generative', 0.045), ('priori', 0.044), ('curvature', 0.044), ('kramer', 0.044), ('data', 0.043), ('intuition', 0.043), ('theoretic', 0.042), ('dimensions', 0.042), ('channel', 0.042), ('capacity', 0.04), ('optimal', 0.039), ('resolution', 0.038), ('ball', 0.038), ('gtm', 0.037), ('becomes', 0.037), ('projection', 0.037), ('correlation', 0.036), ('xy', 0.035), ('faithful', 0.035), ('equations', 0.034), ('reduced', 0.033), ('shape', 0.033), ('constraining', 0.033), ('terminated', 0.033), ('xd', 0.033), ('isomap', 0.033), ('precision', 0.033), ('embedding', 0.033), ('plane', 0.032), ('willing', 0.032), ('eigenmaps', 0.032), ('principal', 0.031), ('emerges', 0.03), ('modeling', 0.03), ('log', 0.03), ('alternatively', 0.03), ('penalize', 0.029), ('manifolds', 0.029), ('tells', 0.029), ('implicitly', 0.029), ('rate', 0.029), ('princeton', 0.028), ('correlations', 0.028), ('low', 0.028), ('dimension', 0.028), ('coordinate', 0.028), ('hastie', 0.027), ('presumably', 0.027), ('representation', 0.027), ('solution', 0.026), ('quickly', 0.025), ('positions', 0.025), ('variable', 0.025), ('laplacian', 0.025), ('hyperplane', 0.024), ('nonlinearity', 0.024), ('smoothing', 0.024), ('letting', 0.024), ('tools', 0.024), ('whole', 0.023), ('normal', 0.023), ('introduce', 0.022), ('stochastic', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach
Author: Denis V. Chigirev, William Bialek
Abstract: We introduce an information theoretic method for nonparametric, nonlinear dimensionality reduction, based on the infinite cluster limit of rate distortion theory. By constraining the information available to manifold coordinates, a natural probabilistic map emerges that assigns original data to corresponding points on a lower dimensional manifold. With only the information-distortion trade off as a parameter, our method determines the shape of the manifold, its dimensionality, the probabilistic map and the prior that provide optimal description of the data. 1 A simple example Some data sets may not be as complicated as they appear. Consider the set of points on a plane in Figure 1. As a two dimensional set, it requires a two dimensional density ρ(x, y) for its description. Since the data are sparse the density will be almost singular. We may use a smoothing kernel, but then the data set will be described by a complicated combination of troughs and peaks with no obvious pattern and hence no ability to generalize. We intuitively, however, see a strong one dimensional structure (a curve) underlying the data. In this paper we attempt to capture this intuition formally, through the use of the infinite cluster limit of rate distortion theory. Any set of points can be embedded in a hypersurface of any intrinsic dimensionality if we allow that hypersurface to be highly “folded.” For example, in Figure 1, any curve that goes through all the points gives a one dimensional representation. We would like to avoid such solutions, since they do not help us discover structure in the data. Looking for a simpler description one may choose to penalize the curvature term [1]. The problem with this approach is that it is not easily generalized to multiple dimensions, and requires the dimensionality of the solution as an input. An alternative approach is to allow curves of all shapes and sizes, but to send the reduced coordinates through an information bottleneck. With a fixed number of bits, position along a highly convoluted curve becomes uncertain. This will penalize curves that follow the data too closely (see Figure 1). There are several advantages to this approach. First, it removes the artificiality introduced by Hastie [2] of adding to the cost function only orthogonal errors. If we believe that data points fall out of the manifold due to noise, there is no reason to treat the projection onto the manifold as exact. Second, it does not require the dimension- 9 8 Figure 1: Rate distortion curve for a data set of 25 points (red). We used 1000 points to represent the curve which where initialized by scattering them uniformly on the plane. Note that the produced curve is well defined, one dimensional and smooth. 7 6 5 4 3 2 1 0 2 4 6 8 10 12 ality of the solution manifold as an input. By adding extra dimensions, one quickly looses the precision with which manifold points are specified (due to the fixed information bottleneck). Hence, the optimal dimension emerges naturally. This also means that the method works well in many dimensions with no adjustments. Third, the method handles sparse data well. This is important since in high dimensional spaces all data sets are sparse, i.e. they look like points in Figure 1, and the density estimation becomes impossible. Luckily, if the data are truly generated by a lower dimensional process, then density estimation in the data space is not important (from the viewpoint of prediction or any other). What is critical is the density of the data along the manifold (known in latent variable modeling as a prior), and our algorithm finds it naturally. 2 Latent variable models and dimensionality reduction Recently, the problem of reducing the dimensionality of a data set has received renewed attention [3,4]. The underlying idea, due to Hotelling [5], is that most of the variation in many high dimensional data sets can often be explained by a few latent variables. Alternatively, we say that rather than filling the whole space, the data lie on a lower dimensional manifold. The dimensionality of this manifold is the dimensionality of the latent space and the coordinate system on this manifold provides the latent variables. Traditional tools of principal component analysis (PCA) and factor analysis (FA) are still the most widely used methods in data analysis. They project the data onto a hyperplane, so the reduced coordinates are easy to interpret. However, these methods are unable to deal with nonlinear correlations in a data set. To accommodate nonlinearity in a data set, one has to relax the assumption that the data is modeled by a hyperplane, and allow a general low dimensional manifold of unknown shape and dimensionality. The same questions that we asked in the previous section apply here. What do we mean by requiring that “the manifold models the data well”? In the next section, we formalize this notion by defining the manifold description of data as a doublet (the shape of the manifold and the projection map). Note that we do not require the probability distribution over the manifold (known for generative models [6,7] as a prior distribution over the latent variables and postulated a priori). It is completely determined by the doublet. Nonlinear correlations in data can also be accommodated implicitly, without constructing an actual low dimensional manifold. By mapping the data from the original space to an even higher dimensional feature space, we may hope that the correlations will become linearized and PCA will apply. Kernel methods [8] allow us to do this without actually constructing an explicit map to feature space. They introduce nonlinearity through an a priori nonlinear kernel. Alternatively, autoassociative neural networks [9] force the data through a bottleneck (with an internal layer of desired dimensionality) to produce a reduced description. One of the disadvantages of these methods is that the results are not easy to interpret. Recent attempts to describe a data set with a low dimensional representation generally follow into two categories: spectral methods and density modeling methods. Spectral methods (LLE [3], ISOMAP [4], Laplacian eigenmaps [10]) give reduced coordinates of an a priori dimensionality by introducing a quadratic cost function in reduced coordinates (hence eigenvectors are solutions) that mimics the relationships between points in the original data space (geodesic distance for ISOMAP, linear reconstruction for LLE). Density modeling methods (GTM [6], GMM [7]) are generative models that try to reproduce the data with fewer variables. They require a prior and a parametric generative model to be introduced a priori and then find optimal parameters via maximum likelihood. The approach that we will take is inspired by the work of Kramer [9] and others who tried to formulate dimensionality reduction as a compression problem. They tried to solve the problem by building an explicit neural network encoder-decoder system which restricted the information implicitly by limiting the number of nodes in the bottleneck layer. Extending their intuition with the tools of information theory, we recast dimensionality reduction as a compression problem where the bottleneck is the information available to manifold coordinates. This allows us to define the optimal manifold description as that which produces the best reconstruction of the original data set, given that the coordinates can only be transmitted through a channel of fixed capacity. 3 Dimensionality reduction as compression Suppose that we have a data set X in a high dimensional state space RD described by a density function ρ(x). We would like to find a “simplified” description of this data set. One may do so by visualizing a lower dimensional manifold M that “almost” describes the data. If we have a manifold M and a stochastic map PM : x → PM (µ|x) to points µ on the manifold, we will say that they provide a manifold description of the data set X. Note that the stochastic map here is well justified: if a data point does not lie exactly on the manifold then we should expect some uncertainty in the estimation of the value of its latent variables. Also note that we do not need to specify the inverse (generative) map: M → RD ; it can be obtained by Bayes’ rule. The manifold description (M, PM ) is a less than faithful representation of the data. To formalize this notion we will introduce the distortion measure D(M, PM , ρ): ρ(x)PM (µ|x) x − µ 2 dD xDµ. D(M, PM , ρ) = x∈RD (1) µ∈M Here we have assumed the Euclidean distance function for simplicity. The stochastic map, PM (µ|x), together with the density, ρ(x), define a joint probability function P (M, X) that allows us to calculate the mutual information between the data and its manifold representation: I(X, M) = P (x, µ) log x∈X µ∈M P (x, µ) dD xDµ. ρ(x)PM (µ) (2) This quantity tells us how many bits (on average) are required to encode x into µ. If we view the manifold representation of X as a compression scheme, then I(X, M) tells us the necessary capacity of the channel needed to transmit the compressed data. Ideally, we would like to obtain a manifold description {M, PM (M|X)} of the data set X that provides both a low distortion D(M, PM , ρ) and a good compression (i.e. small I(X, M)). The more bits we are willing to provide for the description of the data, the more detailed a manifold that can be constructed. So there is a trade off between how faithful a manifold representation can be and how much information is required for its description. To formalize this notion we introduce the concept of an optimal manifold. DEFINITION. Given a data set X and a channel capacity I, a manifold description (M, PM (M|X)) that minimizes the distortion D(M, PM , X), and requires only information I for representing an element of X, will be called an optimal manifold M(I, X). Note that another way to define an optimal manifold is to require that the information I(M, X) is minimized while the average distortion is fixed at value D. The shape and the dimensionality of optimal manifold depends on our information resolution (or the description length that we are willing to allow). This dependence captures our intuition that for real world, multi-scale data, a proper manifold representation must reflect the compression level we are trying to achieve. To find the optimal manifold (M(I), PM(I) ) for a given data set X, we must solve a constrained optimization problem. Let us introduce a Lagrange multiplier λ that represents the trade off between information and distortion. Then optimal manifold M(I) minimizes the functional: F(M, PM ) = D + λI. (3) Let us parametrize the manifold M by t (presumably t ∈ Rd for some d ≤ D). The function γ(t) : t → M maps the points from the parameter space onto the manifold and therefore describes the manifold. Our equations become: D = dD x dd t ρ(x)P (t|x) x − γ(t) 2 , I = dD x dd t ρ(x)P (t|x) log P (t|x) , P (t) F(γ(t), P (t|x)) = D + λI. (4) (5) (6) Note that both information and distortion measures are properties of the manifold description doublet {M, PM (M|X)} and are invariant under reparametrization. We require the variations of the functional to vanish for optimal manifolds δF/δγ(t) = 0 and δF/δP (t|x) = 0, to obtain the following set of self consistent equations: P (t) = γ(t) = P (t|x) = Π(x) = dD x ρ(x)P (t|x), 1 dD x xρ(x)P (t|x), P (t) P (t) − 1 x−γ (t) 2 e λ , Π(x) 2 1 dd t P (t)e− λ x−γ (t) . (7) (8) (9) (10) In practice we do not have the full density ρ(x), but only a discrete number of samples. 1 So we have to approximate ρ(x) = N δ(x − xi ), where N is the number of samples, i is the sample label, and xi is the multidimensional vector describing the ith sample. Similarly, instead of using a continuous variable t we use a discrete set t ∈ {t1 , t2 , ..., tK } of K points to model the manifold. Note that in (7 − 10) the variable t appears only as an argument for other functions, so we can replace the integral over t by a sum over k = 1..K. Then P (t|x) becomes Pk (xi ),γ(t) is now γ k , and P (t) is Pk . The solution to the resulting set of equations in discrete variables (11 − 14) can be found by an iterative Blahut-Arimoto procedure [11] with an additional EM-like step. Here (n) denotes the iteration step, and α is a coordinate index in RD . The iteration scheme becomes: (n) Pk (n) γk,α = = N 1 N (n) Pk (xi ) = Π(n) (xi ) N 1 1 (n) N P k where α (11) i=1 = (n) xi,α Pk (xi ), (12) i=1 1, . . . , D, K (n) 1 (n) Pk e− λ xi −γ k 2 (13) k=1 (n) (n+1) Pk (xi ) = (n) 2 Pk 1 . e− λ xi −γ k (n) (x ) Π i (14) 0 0 One can initialize γk and Pk (xi ) by choosing K points at random from the data set and 0 letting γk = xi(k) and Pk = 1/K, then use equations (13) and (14) to initialize the 0 association map Pk (xi ). The iteration procedure (11 − 14) is terminated once n−1 n max |γk − γk | < , (15) k where determines the precision with which the manifold points are located. The above algorithm requires the information distortion cost λ = −δD/δI as a parameter. If we want to find the manifold description (M, P (M|X)) for a particular value of information I, we can plot the curve I(λ) and, because it’s monotonic, we can easily find the solution iteratively, arbitrarily close to a given value of I. 4 Evaluating the solution The result of our algorithm is a collection of K manifold points, γk ∈ M ⊂ RD , and a stochastic projection map, Pk (xi ), which maps the points from the data space onto the manifold. Presumably, the manifold M has a well defined intrinsic dimensionality d. If we imagine a little ball of radius r centered at some point on the manifold of intrinsic dimensionality d, and then we begin to grow the ball, the number of points on the manifold that fall inside will scale as rd . On the other hand, this will not be necessarily true for the original data set, since it is more spread out and resembles locally the whole embedding space RD . The Grassberger-Procaccia algorithm [12] captures this intuition by calculating the correlation dimension. First, calculate the correlation integral: 2 C(r) = N (N − 1) N N H(r − |xi − xj |), (16) i=1 j>i where H(x) is a step function with H(x) = 1 for x > 0 and H(x) = 0 for x < 0. This measures the probability that any two points fall within the ball of radius r. Then define 0 original data manifold representation -2 ln C(r) -4 -6 -8 -10 -12 -14 -5 -4 -3 -2 -1 0 1 2 3 4 ln r Figure 2: The semicircle. (a) N = 3150 points randomly scattered around a semicircle of radius R = 20 by a normal process with σ = 1 and the final positions of 100 manifold points. (b) Log log plot of C(r) vs r for both the manifold points (squares) and the original data set (circles). the correlation dimension at length scale r as the slope on the log log plot. dcorr (r) = d log C(r) . d log r (17) For points lying on a manifold the slope remains constant and the dimensionality is fixed, while the correlation dimension of the original data set quickly approaches that of the embedding space as we decrease the length scale. Note that the slope at large length scales always tends to decrease due to finite span of the data and curvature effects and therefore does not provide a reliable estimator of intrinsic dimensionality. 5 5.1 Examples Semi-Circle We have randomly generated N = 3150 data points scattered by a normal distribution with σ = 1 around a semi-circle of radius R = 20 (Figure 2a). Then we ran the algorithm with K = 100 and λ = 8, and terminated the iterative algorithm once the precision = 0.1 had been reached. The resulting manifold is depicted in red. To test the quality of our solution, we calculated the correlation dimension as a function of spatial scale for both the manifold points and the original data set (Figure 2b). As one can see, the manifold solution is of fixed dimensionality (the slope remains constant), while the original data set exhibits varying dimensionality. One should also note that the manifold points have dcorr (r) = 1 well into the territory where the original data set becomes two dimensional. This is what we should expect: at a given information level (in this case, I = 2.8 bits), the information about the second (local) degree of freedom is lost, and the resulting structure is one dimensional. A note about the parameters. Letting K → ∞ does not alter the solution. The information I and distortion D remain the same, and the additional points γk also fall on the semi-circle and are simple interpolations between the original manifold points. This allows us to claim that what we have found is a manifold, and not an agglomeration of clustering centers. Second, varying λ changes the information resolution I(λ): for small λ (high information rate) the local structure becomes important. At high information rate the solution undergoes 3.5 3 3 3 2.5 2.5 2 2.5 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0 0.5 -0.5 0 0 -1 5 -0.5 -0.5 4 1 3 0.5 2 -1 -1 0 1 -0.5 0 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Figure 3: S-shaped sheet in 3D. (a) N = 2000 random points on a surface of an S-shaped sheet in 3D. (b) Normal noise added. XY-plane projection of the data. (c) Optimal manifold points in 3D, projected onto an XY plane for easy visualization. a phase transition, and the resulting manifold becomes two dimensional to take into account the local structure. Alternatively, if we take λ → ∞, the cost of information rate becomes very high and the whole manifold collapses to a single point (becomes zero dimensional). 5.2 S-surface Here we took N = 2000 points covering an S-shaped sheet in three dimensions (Figure 3a), and then scattered the position of each point by adding Gaussian noise. The resulting manifold is difficult to visualize in three dimensions, so we provided its projection onto an XY plane for an illustrative purpose (Figure 3b). After running our algorithm we have recovered the original structure of the manifold (Figure 3c). 6 Discussion The problem of finding low dimensional manifolds in high dimensional data requires regularization to avoid hgihly folded, Peano curve like solutions which are low dimensional in the mathematical sense but fail to capture our geometric intuition. Rather than constraining geometrical features of the manifold (e.g., the curvature) we have constrained the mutual information between positions on the manifold and positions in the original data space, and this is invariant to all invertible coordinate transformations in either space. This approach enforces “smoothness” of the manifold only implicitly, but nonetheless seems to work. Our information theoretic approach has considerable generality relative to methods based on specific smoothing criteria, but requires a separate algorithm, such as LLE, to give the manifold points curvilinear coordinates. For data points not in the original data set, equations (9-10) and (13-14) provide the mapping onto the manifold. Eqn. (7) gives the probability distribution over the latent variable, known in the density modeling literature as “the prior.” The running time of the algorithm is linear in N . This compares favorably with other methods and makes it particularly attractive for very large data sets. The number of manifold points K usually is chosen as large as possible, given the computational constraints, to have a dense sampling of the manifold. However, a value of K << N is often sufficient, since D(λ, K) → D(λ) and I(λ, K) → I(λ) approach their limits rather quickly (the convergence improves for large λ and deteriorates for small λ). In the example of a semi-circle, the value of K = 30 was sufficient at the compression level of I = 2.8 bits. In general, the threshold value for K scales exponentially with the latent dimensionality (rather than with the dimensionality of the embedding space). The choice of λ depends on the desired information resolution, since I depends on λ. Ideally, one should plot the function I(λ) and then choose the region of interest. I(λ) is a monotonically decreasing function, with the kinks corresponding to phase transitions where the optimal manifold abruptly changes its dimensionality. In practice, we may want to run the algorithm only for a few choices of λ, and we would like to start with values that are most likely to correspond to a low dimensional latent variable representation. In this case, as a rule of thumb, we choose λ smaller, but on the order of the largest linear dimension (i.e. λ/2 ∼ Lmax ). The dependence of the optimal manifold M(I) on information resolution reflects the multi-scale nature of the data and should not be taken as a shortcoming. References [1] Bregler, C. & Omohundro, S. (1995) Nonlinear image interpolation using manifold learning. Advances in Neural Information Processing Systems 7. MIT Press. [2] Hastie, T. & Stuetzle, W. (1989) Principal curves. Journal of the American Statistical Association, 84(406), 502-516. [3] Roweis, S. & Saul, L. (2000) Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326. [4] Tenenbaum, J., de Silva, V., & Langford, J. (2000) A global geometric framework for nonlinear dimensionality reduction. Science, 290 , 2319–2323. [5] Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441,498-520. [6] Bishop, C., Svensen, M. & Williams, C. (1998) GTM: The generative topographic mapping. Neural Computation,10, 215–234. [7] Brand, M. (2003) Charting a manifold. Advances in Neural Information Processing Systems 15. MIT Press. [8] Scholkopf, B., Smola, A. & Muller K-R. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319. [9] Kramer, M. (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37, 233-243. [10] Belkin M. & Niyogi P. (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396. [11] Blahut, R. (1972) Computation of channel capacity and rate distortion function. IEEE Trans. Inform. Theory, IT-18, 460-473. [12] Grassberger, P., & Procaccia, I. (1983) Characterization of strange attractors. Physical Review Letters, 50, 346-349.
2 0.17269599 120 nips-2003-Locality Preserving Projections
Author: Xiaofei He, Partha Niyogi
Abstract: Many problems in information processing involve some form of dimensionality reduction. In this paper, we introduce Locality Preserving Projections (LPP). These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data set. LPP should be seen as an alternative to Principal Component Analysis (PCA) – a classical linear technique that projects the data along the directions of maximal variance. When the high dimensional data lies on a low dimensional manifold embedded in the ambient space, the Locality Preserving Projections are obtained by finding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the manifold. As a result, LPP shares many of the data representation properties of nonlinear techniques such as Laplacian Eigenmaps or Locally Linear Embedding. Yet LPP is linear and more crucially is defined everywhere in ambient space rather than just on the training data points. This is borne out by illustrative examples on some high dimensional data sets.
3 0.15380467 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
Author: Jakob J. Verbeek, Sam T. Roweis, Nikos A. Vlassis
Abstract: We propose a non-linear Canonical Correlation Analysis (CCA) method which works by coordinating or aligning mixtures of linear models. In the same way that CCA extends the idea of PCA, our work extends recent methods for non-linear dimensionality reduction to the case where multiple embeddings of the same underlying low dimensional coordinates are observed, each lying on a different high dimensional manifold. We also show that a special case of our method, when applied to only a single manifold, reduces to the Laplacian Eigenmaps algorithm. As with previous alignment schemes, once the mixture models have been estimated, all of the parameters of our model can be estimated in closed form without local optima in the learning. Experimental results illustrate the viability of the approach as a non-linear extension of CCA. 1
4 0.14710447 128 nips-2003-Minimax Embeddings
Author: Matthew Brand
Abstract: Spectral methods for nonlinear dimensionality reduction (NLDR) impose a neighborhood graph on point data and compute eigenfunctions of a quadratic form generated from the graph. We introduce a more general and more robust formulation of NLDR based on the singular value decomposition (SVD). In this framework, most spectral NLDR principles can be recovered by taking a subset of the constraints in a quadratic form built from local nullspaces on the manifold. The minimax formulation also opens up an interesting class of methods in which the graph is “decorated” with information at the vertices, offering discrete or continuous maps, reduced computational complexity, and immunity to some solution instabilities of eigenfunction approaches. Apropos, we show almost all NLDR methods based on eigenvalue decompositions (EVD) have a solution instability that increases faster than problem size. This pathology can be observed (and corrected via the minimax formulation) in problems as small as N < 100 points. 1 Nonlinear dimensionality reduction (NLDR) . Spectral NLDR methods are graph embedding problems where a set of N points X = [x1 , · · · , xN ] ∈ RD×N sampled from a low-dimensional manifold in a ambient space RD is reparameterized by imposing a neighborhood graph G on X and embedding the graph with minimal distortion in a “parameterization” space Rd , d < D. Typically the graph is sparse and local, with edges connecting points to their immediate neighbors. The embedding must keep these edges short or preserve their length (for isometry) or angles (for conformality). The graph-embedding problem was first introduced as a least-squares problem by Tutte [1], and as an eigenvalue problem by Fiedler [2]. The use of sparse graphs to generate metrics for least-squares problems has been studied intensely in the following three decades (see [3]). Modern NLDR methods use graph constraints to generate a metric in a space of embeddings RN . Eigenvalue decomposition (EVD) gives the directions of least or greatest variance under this metric. Typically a subset of d extremal eigenvectors gives the embedding of N points in Rd parameterization space. This includes the IsoMap family [4], the locally linear embedding (LLE) family [5,6], and Laplacian methods [7,8]. Using similar methods, the Automatic Alignment [6] and Charting [9] algorithms embed local subspaces instead of points, and by combining subspace projections thus obtain continuous maps between RD and Rd . This paper introduces a general algebraic framework for computing optimal embeddings directly from graph constraints. The aforementioned methods can can be recovered as special cases. The framework also suggests some new methods with very attractive properties, including continuous maps, reduced computational complexity, and control over the degree of conformality/isometry in the desired map. It also eliminates a solution instability that is intrinsic to EVD-based approaches. A perturbational analysis quantifies the instability. 2 Minimax theorem for graph embeddings We begin with neighborhood graph specified by a nondiagonal weighted adjacency matrix M ∈ RN×N that has the data-reproducing property XM = X (this can be relaxed to XM ≈ X in practice). The graph-embedding and NLDR literatures offer various constructions of M, each appropriate to different sets of assumptions about the original embedding and its sampling X (e.g., isometry, local linearity, noiseless samples, regular sampling, etc.). Typically Mi j = 0 if points i, j are nearby on the intrinsic manifold and |Mi j | is small or zero otherwise. Each point is taken to be a linear or convex combination of its neighbors, and thus M specifies manifold connectivity in the sense that any nondegenerate embedding Y that satisfies YM ≈ Y with small residual YM − Y F will preserve this connectivity and the structure of local neighborhoods. For example, in barycentric embeddings, each point is the average of its neighbors and thus Mi j = 1/k if vertex i is connected to vertex j (of degree k). We will also consider three optional constraints on the embedding : 1. A null-space restriction, where the solution must be outside to the column-space of C ∈ RN×M , M < N. For example, it is common to stipulate that the solution Y be centered, i.e., YC = 0 for C = 1, the constant vector. 2. A basis restriction, where the solution must be a linear combination of the rows of basis Z ∈ RK×N , K ≤ N. This can be thought of as information placed at the vertices of the graph that serves as example inputs for a target NLDR function. We will use this to construct dimension-reducing radial basis function networks. 3. A metric Σ ∈ RN×N that determines how error is distributed over the points. For example, it might be important that boundary points have less error. We assume that Σ is symmetric positive definite and has factorization Σ = AA (e.g., A could be a Cholesky factor of Σ). In most settings, the optional matrices will default to the identity matrix. In this context, we define the per-dimension embedding error of row-vector yi ∈ rows(Y) to be . EM (yi ) = max yi ∈range(Z),, K∈RM×N (yi (M + CD) − yi )A yi A (1) where D is a matrix constructed by an adversary to maximize the error. The optimizing yi is a vector inside the subspace spanned by the rows of Z and outside the subspace spanned by the columns of C, for which the reconstruction residual yi M−yi has smallest norm w.r.t. the metric Σ. The following theorem identifies the optimal embedding Y for any choice of M, Z, C, Σ: Minimax solution: Let Q ∈ SK×P be a column-orthonormal basis of the null-space of the rows of ZC, with P = K − rank(C). Let B ∈ RP×P be a square factor satisfying B B = Q ZΣZ Q, e.g., a Cholesky factor (or the “R” factor in QR-decomposition of (Q ZA) ). Compute the left singular vectors U ∈ SN×N of Udiag(s)V = B− Q Z(I − M)A, with . singular values s = [s1 , · · · , sP ] ordered s1 ≤ s2 ≤ · · · ≤ s p . Using the leading columns U1:d of U, set Y = U1:d B− Q Z. Theorem 1. Y is the optimal (minimax) embedding in Rd with error [s1 , · · · , sd ] 2 : . Y = U1:d B− Q Z = arg min ∑ EM (yi )2 with EM (yi ) = si . Y∈Rd×N y ∈rows(Y) i (2) Appendix A develops the proof and other error measures that are minimized. Local NLDR techniques are easily expressed in this framework. When Z = A = I, C = [], and M reproduces X through linear combinations with M 1 = 1, we recover LLE [5]. When Z = I, C = [], I − M is the normalized graph Laplacian, and A is a diagonal matrix of vertex degrees, we recover Laplacian eigenmaps [7]. When further Z = X we recover locally preserving projections [8]. 3 Analysis and generalization of charting The minimax construction of charting [9] takes some development, but offers an interesting insight into the above-mentioned methods. Recall that charting first solves for a set of local affine subspace axes S1 ∈ RD×d , S2 , · · · at offsets µ1 ∈ RD , µ2 , · · · that best cover the data and vary smoothly over the manifold. Each subspace offers a chart—a local parameterization of the data by projection onto the local axes. Charting then constructs a weighted mixture of affine projections that merges the charts into a global parameterization. If the data manifold is curved, each projection will assign a point a slightly different embedding, so the error is measured as the variance of these proposed embeddings about their mean. This maximizes consistency and tends to produce isometric embeddings; [9] discusses ways to explicitly optimize the isometry of the embedding. Under the assumption of isometry, the charting error is equivalent to the sumsquared displacements of an embedded point relative to its immediate neighbors (summed over all neighborhoods). To construct the same error criteria in the minimax setting, let xi−k , · · · , xi , · · · , xi+k denote points in the ith neighborhood and let the columns of Vi ∈ R(2k+1)×d be an orthonormal basis of rows of the local parameterization Si [xi−k , · · · , xi , · · · , xi+k ]. Then a nonzero reparameterization will satisfy [yi−k , · · · , yi , · · · , yi+k ]Vi Vi = [yi−k , · · · , yi , · · · , yi+k ] if and only if it preserves the relative position of the points in the local parameterization. Conversely, any relative displacements of the points are isolated by the formula [yi−k , · · · , yi , · · · , yi+k ](I − Vi Vi ). Minimizing the Frobenius norm of this expression is thus equivalent to minimizing the local error in charting. We sum these constraints over all neighborhoods to obtain the constraint matrix M = I − ∑i Fi (I − Vi Vi )Fi , where (Fi )k j = 1 iff the jth point of the ith neighborhood is the kth point of the dataset. Because Vi Vi and (I − Vi Vi ) are complementary, it follows that the error criterion of any local NLDR method (e.g., LLE, Laplacian eigenmaps, etc.) must measure the projection of the embedding onto some subspace of (I − Vi Vi ). To construct a continuous map, charting uses an overcomplete radial basis function (RBF) representation Z = [z(x1 ), z(x2 ), · · · z(xN )], where z(x) is a vector that stacks z1 (x), z2 (x), etc., and pm (x) . Km (x − µm ) , zm (x) = 1 ∑m pm (x) −1 . pm (x) = N (x|µm , Σm ) ∝ e−(x−µm ) Σm (x−µm )/2 (3) (4) and Km is any local linear dimensionality reducer, typically Sm itself. Each column of Z contains many “views” of the same point that are combined to give its low-dimensional embedding. Finally, we set C = 1, which forces the embedding of the full data to be centered. Applying the minimax solution to these constraints yields the RBF network mixing ma. trix, f (x) = U1:d B− Q z(x). Theorem 1 guarantees that the resulting embedding is leastsquares optimal w.r.t. Z, M, C, A at the datapoints f (xi ), and because f (·) is an affine transform of z(·) it smoothly interpolates the embedding between points. There are some interesting variants: Kernel embeddings of the twisted swiss roll generalized EVD minimax SVD UR corner detail LL corner detail Fig. 1. Minimax and generalized EVD solution for kernel eigenmap of a non-developable swiss roll. Points are connected into a grid which ideally should be regular. The EVD solution shows substantial degradation. Insets detail corners where the EVD solution crosses itself repeatedly. The border compression is characteristic of Laplacian constraints. One-shot charting: If we set the local dimensionality reducers to the identity matrix (all Km = I), then the minimax method jointly optimizes the local dimensionality reduction to charts and the global coordination of the charts (under any choice of M). This requires that rows(Z) ≤ N for a fully determined solution. Discrete isometric charting: If Z = I then we directly obtain a discrete isometric embedding of the data, rather than a continuous map, making this a local equivalent of IsoMap. Reduced basis charting: Let Z be constructed using just a small number of kernels randomly placed on the data manifold, such that rows(Z) N. Then the size of the SVD problem is substantially reduced. 4 Numerical advantage of minimax method Note that the minimax method projects the constraint matrix M into a subspace derived from C and Z and decomposes it there. This suppresses unwanted degrees of freedom (DOFs) admitted by the problem constraints, for example the trivial R0 embedding where all points are mapped to a single point yi = N −1/2 . The R0 embedding serves as a translational DOF in the solution. LLE- and eigenmap-based methods construct M to have a constant null-space so that the translational DOF will be isolated in the EVD as null eigenvalue paired to a constant eigenvector, which is then discarded. However, section 4.1 shows that this construction makes the EVD increasingly unstable as problem size grows and/or the data becomes increasing amenable to low-residual embeddings, ultimately causing solution collapse. As the next paragraph demonstrates, the problem is exacerbated when embedding w.r.t. a basis Z (via the equivalent generalized eigenproblem), partly because the eigenvector associated with the unwanted DOF can have arbitrary structure. In all cases the problem can be averted by using the minimax formulation with C = 1 to suppress the DOF. A 2D plane was embedded in 3D with a curl, a twist, and 2.5% Gaussian noise, then regularly sampled at 900 points. We computed a kernelized Laplacian eigenmap using 70 random points as RBF centers, i.e., a continous map using M derived from the graph Laplacian and Z constructed as above. The map was computed both via the minimax (SVD) method and via the equivalent generalized eigenproblem, where the translational degree of freedom must be removed by discarding an eigenvector from the solution. The two solutions are algebraically equivalent in every other regard. A variety of eigensolvers were tried; we took −5 excess energy x 10 Eigen spectrum compared to minimax spectrum 15 10 5 0 −5 Eigen spectrum compared to minimax spectrum 2 15 deviation excess energy x 10 10 5 100 200 eigenvalue Error in null embedding −5 x 10 0 −2 −4 −6 −8 0 100 −5 eigenvalue Error in null embedding 200 100 200 300 400 500 point 600 700 800 900 Fig. 2. Excess energy in the eigenspectrum indicates that the translational DOF has contam2 inated many eigenvectors. If the EVD had successfully isolated the unwanted DOF, then its 0 remaining eigenvalues should be identical to those derived from the minimax solution. The −2 −4 graph at left shows the difference in the eigenspectra. The graph at right shows the EVD −6 solution’s deviation from the translational vector y0 = 1 · N −1/2 ≈ .03333. If the numer−8 ics were100 200 the line would be flat, but in practice the deviation is significant enough perfect 300 400 500 600 700 800 900 point (roughly 1% of the diameter of the embedding) to noticably perturb points in figure 1. deviation x 10 the best result. Figure 1 shows that the EVD solution exhibits many defects, particularly a folding-over of the manifold at the top and bottom edges and at the corners. Figure 2 shows that the noisiness of the EVD solution is due largely to mutual contamination of numerically unstable eigenvectors. 4.1 Numerical instability of eigen-methods The following theorem uses tools of matrix perturbation theory to show that as the problem size increases, the desired and unwanted eigenvectors become increasingly wobbly and gradually contaminate each other, leading to degraded solutions. More precisely, the low-order eigenvalues are ill-conditioned and exhibit multiplicities that may be true (due to noiseless samples from low-curvature manifolds) or false (due to numerical noise). Although in many cases some post-hoc algebra can “filter” the unwanted components out of the contaminated eigensolution, it is not hard to construct cases where the eigenvectors cannot be cleanly separated. The minimax formulation is immune to this problem because it explicitly suppresses the gratuitous component(s) before matrix decomposition. Theorem 2. For any finite numerical precision, as the number of points N increases, the Frobenius norm of numerical noise in the null eigenvector v0 can grow as O(N 3/2 ), and the eigenvalue problem can approach a false multiplicity at a rate as fast as O(N 3/2 ), at which point the eigenvectors of interest—embedding and translational—are mutually contaminated and/or have an indeterminate eigenvalue ordering. Please see appendix B for the proof. This theorem essentially lower-bounds an upperbound on error; examples can be constructed in which the problem is worse. For example, it can be shown analytically that when embedding points drawn from the simple curve xi = [a, cos πa] , a ∈ [0, 1] with K = 2 neighbors, instabilities cannot be bounded better than O(N 5/2 ); empirically we see eigenvector mixing with N < 100 points and we see it grow at the rate ≈ O(N 4 )—in many different eigensolvers. At very large scales, more pernicious instabilities set in. E.g., by N = 20000 points, the solution begins to fold over. Although algebraic multiplicity and instability of the eigenproblem is conceptually a minor oversight in the algorithmic realizations of eigenfunction embeddings, as theorem 2 shows, the consequences are eventually fatal. 5 Summary One of the most appealing aspects of the spectral NLDR literature is that algorithms are usually motivated from analyses of linear operators on smooth differentiable manifolds, e.g., [7]. Understandably, these analysis rely on assumptions (e.g., smoothness or isometry or noiseless sampling) that make it difficult to predict what algorithmic realizations will do when real, noisy data violates these assumptions. The minimax embedding theorem provides a complete algebraic characterization of this discrete NLDR problem, and provides a solution that recovers numerically robustified versions of almost all known algorithms. It offers a principled way of constructing new algorithms with clear optimality properties and good numerical conditioning—notably the construction of a continuous NLDR map (an RBF network) in a one-shot optimization ( SVD ). We have also shown how to cast several local NLDR principles in this framework, and upgrade these methods to give continuous maps. Working in the opposite direction, we sketched the minimax formulation of isometric charting and showed that its constraint matrix contains a superset of all the algebraic constraints used in local NLDR techniques. References 1. W.T. Tutte. How to draw a graph. Proc. London Mathematical Society, 13:743–768, 1963. 2. Miroslav Fiedler. A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czech. Math. Journal, 25:619–633, 1975. 3. Fan R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. American Mathematical Society, 1997. 4. Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000. 5. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. 6. Yee Whye Teh and Sam T. Roweis. Automatic alignment of hidden representations. In Proc. NIPS-15, 2003. 7. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. volume 14 of Advances in Neural Information Processing Systems, 2002. 8. Xiafei He and Partha Niyogi. Locality preserving projections. Technical Report TR-2002-09, University of Chicago Computer Science, October 2002. 9. Matthew Brand. Charting a manifold. volume 15 of Advances in Neural Information Processing Systems, 2003. 10. G.W. Stewart and Ji-Guang Sun. Matrix perturbation theory. Academic Press, 1990. A Proof of minimax embedding theorem (1) The burden of this proof is carried by supporting lemmas, below. To emphasize the proof strategy, we give the proof first; supporting lemmas follow. Proof. Setting yi = li Z, we will solve for li ∈ columns(L). Writing the error in terms of li , EM (li ) = max K∈RM×N li Z(I − M − CK)A li Z(I − M)A − li ZCKA = max . M×N li ZA li ZA K∈R (5) The term li ZCKA produces infinite error unless li ZC = 0, so we accept this as a constraint and seek li Z(I − M)A min . (6) li ZA li ZC=0 By lemma 1, that orthogonality is satisfied by solving the problem in the space orthogonal . to ZC; the basis for this space is given by columns of Q = null((ZC) ). By lemma 2, the denominator of the error specifies the metric in solution space to be ZAA Z ; when the problem is projected into the space orthogonal to ZC it becomes Q (ZAA Z )Q. Nesting the “orthogonally-constrained-SVD” construction of lemma 1 inside the “SVD-under-a-metric” lemma 2, we obtain a solution that uses the correct metric in the orthogonal space: B B = Q ZAA Z Q − Udiag(s)V = B (7) {Q(Z(I − M)A)} (8) L = QB−1 U (9) where braces indicate the nesting of lemmas. By the “best-projection” lemma (#3), if we order the singular values by ascending magnitude, L1:d = arg min J∈RN×d ∑ji ∈cols(J) ( j Z(I − M)A / j )2 ZΣZ The proof is completed by making the substitutions L Z → Y and x A → x Σ = AA ), and leaving off the final square root operation to obtain (Y )1:d = arg min ∑ji ∈cols(J) j (I − M) Σ / j J∈RN×d (10) Σ (for 2 Σ . (11) Lemma 1. Orthogonally constrained SVD: The left singular vectors L of matrix M under . SVD the constraint U C = 0 are calculated as Q = null(C ), Udiag(s)V ← Q M, L = QU. Proof. First observe that L is orthogonal to C: By definition, the null-space basis satisfies Q C = 0, thus L C = U Q C = 0. Let J be an orthonormal basis for C, with J J = I and Q J = 0. Then Ldiag(s)V = QQ M = (I − JJ )M, the orthogonal projector of C applied to M, proving that the SVD captures the component of M that is orthogonal to C. Lemma 2. SVD with respect to a metric: The vectors li ∈ L, vi ∈ V that diagonalize matrix M with respect to positive definite column-space metric Σ are calculated as B B ← Σ, SVD . Udiag(s)V ← B− M, L = B−1 U satisfy li M / li Σ = si and extremize this form for the extremal singular values smin , smax . Proof. By construction, L and V diagonalize M: L MV = (B−1 U) MV = U (B− M)V = diag(s) (12) B− and diag(s)V = M. Forming the gram matrices of both sides of the last line, we obtain the identity Vdiag(s)2 V = M B−1 B− M = M Σ−1 M, which demonstrates that si ∈ s are the singular values of M w.r.t. column-space metric Σ. Finally, L is orthonormal w.r.t. the metric Σ, because L 2 = L ΣL = U B− B BB−1 U = I. Consequently, Σ l M / l Σ = l M /1 = si vi and by the Courant-Hilbert theorem, smax = max l M / l Σ ; = si . smin = min l M / l Σ . l l (13) (14) Lemma 3. Best projection: Taking L and s from lemma 2, let the columns of L and elements of s be sorted so that s1 ≥ s2 ≥ · · · ≥ sN . Then for any dimensionality 1 ≤ d ≤ N, . L1:d = [l1 , · · · , ld ] = arg max J M (J ΣJ)−1 (15) J∈RN×d = arg max F (16) ∑ji ∈cols(J) ( j M / j Σ )2 (17) J∈RN×d |J ΣJ=I = arg max J∈RN×d J M with the optimum value of all right hand sides being (∑d s2 )1/2 . If the sort order is rei=1 i versed, the minimum of this form is obtained. Proof. By the Eckart-Young-Mirsky theorem, if U MV = diag(s) with singular values . sorted in descending order, then U1:d = [u1 , · · · , ud ] = arg maxU∈SN×d U M F . We first extend this to a non-orthonogonal basis J under a Mahalonobis norm: maxJ∈RN×d J M (J J)−1 = maxU∈SN×d U M F (18) because J M 2 (J J)−1 = trace(M J(J J)−1 J M) = trace(M JJ+ (JJ+ ) M) = (JJ+ )M 2 = UU M 2 = U M 2 since JJ+ is a (symmetric) orthogonal proF F F jector having binary eigenvalues λ ∈ {0, 1} and therefore it is the gram of an thin orthogonal matrix. We then impose a metric Σ on the column-space of J to obtain the first criterion (equation 15), which asks what maximizes variance in J M while minimizing the norm of J w.r.t. metric Σ. Here it suffices to substitute in the leading (resp., trailing) columns of L and verify that the norm is maximized (resp., minimized). Expanding, L1:d M 2 ΣL )−1 = trace((L1:d M) (L1:d ΣL1:d )−1 (L1:d M)) = (L 1:d 1:d trace((L1:d M) I(L1:d M)) = trace((diag(s1:d )V1:d ) (diag(s1:d )V1:d )) = s1:d 2 . Again, by the Eckart-Young-Mirsky theorem, these are the maximal variance-preserving projections, so the first criterion is indeed maximized by setting J to the columns in L corresponding to the largest values in s. Criterion #2 restates the first criterion with the set of candidates for J restricted to (the hyperelliptical manifold of) matrices that reduce the metric on the norm to the identity matrix (thereby recovering the Frobenius norm). Criterion #3 criterion merely expands the above trace by individual singular values. Note that the numerator and denominator can have different metrics because they are norms in different spaces, possibly of different dimension. Finally, that the trailing d eigenvectors minimize these criteria follows directly from the fact that leading N − d singular values account for the maximal part of the variance. B Proof of instability theorem (2) Proof. When generated from a sparse graph with average degree K, weighted connectivity matrix W is sparse and has O(NK) entries. Since the graph vertices represent samples from a smooth manifold, increasing the sampling density N does not change the distribution of magnitudes in W. Consider a perturbation of the nonzero values in W, e.g., W → W + E due to numerical noise E created by finite machine precision. By the weak law of large √ numbers, the Frobenius norm of the sparse perturbation grows as E F ∼ O( N). However the t th -smallest nonzero eigenvalue λt (W) grows as λt (W) = vt Wvt ∼ O(N −1 ), because elements of corresponding eigenvector vt grow as O(N −1/2 ) and only K of those elements are multiplied by nonzero values to form each element of Wvt . In sum, the perturbation E F grows while the eigenvalue λt (W) shrinks. In linear embedding algorithms, . the eigengap of interest is λgap = λ1 − λ0 . The tail eigenvalue λ0 = 0 by construction but it is possible that λ0 > 0 with numerical error, thus λgap ≤ λ1 . Combining these facts, the ratio between the perturbation and the eigengap grows as E F /λgap ∼ O(N 3/2 ) or faster. Now consider the shifted eigenproblem I − W with leading (maximal) eigenvalues 1 − λ0 ≥ 1 − λ1 ≥ · · · and unchanged eigenvectors. From matrix perturbation the. ory [10, thm. V.2.8], when W is perturbed to W = W + E, the change in the lead√ ing eigenvalue from 1 − λ0 to 1 − λ0 is bounded as |λ0 − λ0 | ≤ 2 E F and similarly √ √ 1 − λ1 ≤ 1 − λ1 + 2 E F . Thus λgap ≥ λgap − 2 E F . Since E F /λgap ∼ O(N 3/2 ), the right hand side of the gap bound goes negative at a supralinear rate, implying that the eigenvalue ordering eventually becomes unstable with the possibility of the first and second eigenvalue/vector pairs being swapped. Mutual contamination of the eigenvectors happens well before: Under general (dense) conditions, the change in the eigenvector v0 is bounded E F as v0 − v0 ≤ |λ −λ4 |−√2 E [10, thm. V.2.8]. (This bound is often tight enough to serve F 0 1 as a good approximation.) Specializing this to the sparse embedding matrix, we find that √ √ O( N) O( N) the bound weakens to v0 − 1 · N −1/2 ∼ O(N −1 )−O(√N) > O(N −1 ) = O(N 3/2 ).
5 0.13568929 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
6 0.13019727 88 nips-2003-Image Reconstruction by Linear Programming
7 0.11761042 164 nips-2003-Ranking on Data Manifolds
8 0.11583064 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
9 0.11394349 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
10 0.10301705 92 nips-2003-Information Bottleneck for Gaussian Variables
11 0.10094981 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
12 0.098390371 81 nips-2003-Geometric Analysis of Constrained Curves
13 0.078736283 51 nips-2003-Design of Experiments via Information Theory
14 0.068000704 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
15 0.064560674 172 nips-2003-Semi-Supervised Learning with Trees
16 0.062492691 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors
17 0.058490444 112 nips-2003-Learning to Find Pre-Images
18 0.056828626 111 nips-2003-Learning the k in k-means
19 0.05618494 30 nips-2003-Approximability of Probability Distributions
20 0.054498926 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
topicId topicWeight
[(0, -0.2), (1, -0.102), (2, -0.018), (3, 0.027), (4, -0.086), (5, 0.16), (6, -0.043), (7, -0.037), (8, 0.064), (9, -0.244), (10, 0.002), (11, -0.023), (12, 0.206), (13, 0.054), (14, 0.009), (15, -0.137), (16, 0.029), (17, 0.062), (18, 0.143), (19, 0.074), (20, -0.033), (21, 0.097), (22, 0.0), (23, 0.024), (24, 0.134), (25, 0.03), (26, 0.033), (27, -0.076), (28, 0.051), (29, 0.074), (30, -0.071), (31, 0.059), (32, 0.11), (33, 0.077), (34, 0.165), (35, 0.031), (36, 0.032), (37, -0.025), (38, -0.01), (39, -0.147), (40, -0.021), (41, -0.011), (42, 0.044), (43, -0.033), (44, -0.041), (45, -0.152), (46, -0.141), (47, -0.181), (48, 0.019), (49, -0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.97225231 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach
Author: Denis V. Chigirev, William Bialek
Abstract: We introduce an information theoretic method for nonparametric, nonlinear dimensionality reduction, based on the infinite cluster limit of rate distortion theory. By constraining the information available to manifold coordinates, a natural probabilistic map emerges that assigns original data to corresponding points on a lower dimensional manifold. With only the information-distortion trade off as a parameter, our method determines the shape of the manifold, its dimensionality, the probabilistic map and the prior that provide optimal description of the data. 1 A simple example Some data sets may not be as complicated as they appear. Consider the set of points on a plane in Figure 1. As a two dimensional set, it requires a two dimensional density ρ(x, y) for its description. Since the data are sparse the density will be almost singular. We may use a smoothing kernel, but then the data set will be described by a complicated combination of troughs and peaks with no obvious pattern and hence no ability to generalize. We intuitively, however, see a strong one dimensional structure (a curve) underlying the data. In this paper we attempt to capture this intuition formally, through the use of the infinite cluster limit of rate distortion theory. Any set of points can be embedded in a hypersurface of any intrinsic dimensionality if we allow that hypersurface to be highly “folded.” For example, in Figure 1, any curve that goes through all the points gives a one dimensional representation. We would like to avoid such solutions, since they do not help us discover structure in the data. Looking for a simpler description one may choose to penalize the curvature term [1]. The problem with this approach is that it is not easily generalized to multiple dimensions, and requires the dimensionality of the solution as an input. An alternative approach is to allow curves of all shapes and sizes, but to send the reduced coordinates through an information bottleneck. With a fixed number of bits, position along a highly convoluted curve becomes uncertain. This will penalize curves that follow the data too closely (see Figure 1). There are several advantages to this approach. First, it removes the artificiality introduced by Hastie [2] of adding to the cost function only orthogonal errors. If we believe that data points fall out of the manifold due to noise, there is no reason to treat the projection onto the manifold as exact. Second, it does not require the dimension- 9 8 Figure 1: Rate distortion curve for a data set of 25 points (red). We used 1000 points to represent the curve which where initialized by scattering them uniformly on the plane. Note that the produced curve is well defined, one dimensional and smooth. 7 6 5 4 3 2 1 0 2 4 6 8 10 12 ality of the solution manifold as an input. By adding extra dimensions, one quickly looses the precision with which manifold points are specified (due to the fixed information bottleneck). Hence, the optimal dimension emerges naturally. This also means that the method works well in many dimensions with no adjustments. Third, the method handles sparse data well. This is important since in high dimensional spaces all data sets are sparse, i.e. they look like points in Figure 1, and the density estimation becomes impossible. Luckily, if the data are truly generated by a lower dimensional process, then density estimation in the data space is not important (from the viewpoint of prediction or any other). What is critical is the density of the data along the manifold (known in latent variable modeling as a prior), and our algorithm finds it naturally. 2 Latent variable models and dimensionality reduction Recently, the problem of reducing the dimensionality of a data set has received renewed attention [3,4]. The underlying idea, due to Hotelling [5], is that most of the variation in many high dimensional data sets can often be explained by a few latent variables. Alternatively, we say that rather than filling the whole space, the data lie on a lower dimensional manifold. The dimensionality of this manifold is the dimensionality of the latent space and the coordinate system on this manifold provides the latent variables. Traditional tools of principal component analysis (PCA) and factor analysis (FA) are still the most widely used methods in data analysis. They project the data onto a hyperplane, so the reduced coordinates are easy to interpret. However, these methods are unable to deal with nonlinear correlations in a data set. To accommodate nonlinearity in a data set, one has to relax the assumption that the data is modeled by a hyperplane, and allow a general low dimensional manifold of unknown shape and dimensionality. The same questions that we asked in the previous section apply here. What do we mean by requiring that “the manifold models the data well”? In the next section, we formalize this notion by defining the manifold description of data as a doublet (the shape of the manifold and the projection map). Note that we do not require the probability distribution over the manifold (known for generative models [6,7] as a prior distribution over the latent variables and postulated a priori). It is completely determined by the doublet. Nonlinear correlations in data can also be accommodated implicitly, without constructing an actual low dimensional manifold. By mapping the data from the original space to an even higher dimensional feature space, we may hope that the correlations will become linearized and PCA will apply. Kernel methods [8] allow us to do this without actually constructing an explicit map to feature space. They introduce nonlinearity through an a priori nonlinear kernel. Alternatively, autoassociative neural networks [9] force the data through a bottleneck (with an internal layer of desired dimensionality) to produce a reduced description. One of the disadvantages of these methods is that the results are not easy to interpret. Recent attempts to describe a data set with a low dimensional representation generally follow into two categories: spectral methods and density modeling methods. Spectral methods (LLE [3], ISOMAP [4], Laplacian eigenmaps [10]) give reduced coordinates of an a priori dimensionality by introducing a quadratic cost function in reduced coordinates (hence eigenvectors are solutions) that mimics the relationships between points in the original data space (geodesic distance for ISOMAP, linear reconstruction for LLE). Density modeling methods (GTM [6], GMM [7]) are generative models that try to reproduce the data with fewer variables. They require a prior and a parametric generative model to be introduced a priori and then find optimal parameters via maximum likelihood. The approach that we will take is inspired by the work of Kramer [9] and others who tried to formulate dimensionality reduction as a compression problem. They tried to solve the problem by building an explicit neural network encoder-decoder system which restricted the information implicitly by limiting the number of nodes in the bottleneck layer. Extending their intuition with the tools of information theory, we recast dimensionality reduction as a compression problem where the bottleneck is the information available to manifold coordinates. This allows us to define the optimal manifold description as that which produces the best reconstruction of the original data set, given that the coordinates can only be transmitted through a channel of fixed capacity. 3 Dimensionality reduction as compression Suppose that we have a data set X in a high dimensional state space RD described by a density function ρ(x). We would like to find a “simplified” description of this data set. One may do so by visualizing a lower dimensional manifold M that “almost” describes the data. If we have a manifold M and a stochastic map PM : x → PM (µ|x) to points µ on the manifold, we will say that they provide a manifold description of the data set X. Note that the stochastic map here is well justified: if a data point does not lie exactly on the manifold then we should expect some uncertainty in the estimation of the value of its latent variables. Also note that we do not need to specify the inverse (generative) map: M → RD ; it can be obtained by Bayes’ rule. The manifold description (M, PM ) is a less than faithful representation of the data. To formalize this notion we will introduce the distortion measure D(M, PM , ρ): ρ(x)PM (µ|x) x − µ 2 dD xDµ. D(M, PM , ρ) = x∈RD (1) µ∈M Here we have assumed the Euclidean distance function for simplicity. The stochastic map, PM (µ|x), together with the density, ρ(x), define a joint probability function P (M, X) that allows us to calculate the mutual information between the data and its manifold representation: I(X, M) = P (x, µ) log x∈X µ∈M P (x, µ) dD xDµ. ρ(x)PM (µ) (2) This quantity tells us how many bits (on average) are required to encode x into µ. If we view the manifold representation of X as a compression scheme, then I(X, M) tells us the necessary capacity of the channel needed to transmit the compressed data. Ideally, we would like to obtain a manifold description {M, PM (M|X)} of the data set X that provides both a low distortion D(M, PM , ρ) and a good compression (i.e. small I(X, M)). The more bits we are willing to provide for the description of the data, the more detailed a manifold that can be constructed. So there is a trade off between how faithful a manifold representation can be and how much information is required for its description. To formalize this notion we introduce the concept of an optimal manifold. DEFINITION. Given a data set X and a channel capacity I, a manifold description (M, PM (M|X)) that minimizes the distortion D(M, PM , X), and requires only information I for representing an element of X, will be called an optimal manifold M(I, X). Note that another way to define an optimal manifold is to require that the information I(M, X) is minimized while the average distortion is fixed at value D. The shape and the dimensionality of optimal manifold depends on our information resolution (or the description length that we are willing to allow). This dependence captures our intuition that for real world, multi-scale data, a proper manifold representation must reflect the compression level we are trying to achieve. To find the optimal manifold (M(I), PM(I) ) for a given data set X, we must solve a constrained optimization problem. Let us introduce a Lagrange multiplier λ that represents the trade off between information and distortion. Then optimal manifold M(I) minimizes the functional: F(M, PM ) = D + λI. (3) Let us parametrize the manifold M by t (presumably t ∈ Rd for some d ≤ D). The function γ(t) : t → M maps the points from the parameter space onto the manifold and therefore describes the manifold. Our equations become: D = dD x dd t ρ(x)P (t|x) x − γ(t) 2 , I = dD x dd t ρ(x)P (t|x) log P (t|x) , P (t) F(γ(t), P (t|x)) = D + λI. (4) (5) (6) Note that both information and distortion measures are properties of the manifold description doublet {M, PM (M|X)} and are invariant under reparametrization. We require the variations of the functional to vanish for optimal manifolds δF/δγ(t) = 0 and δF/δP (t|x) = 0, to obtain the following set of self consistent equations: P (t) = γ(t) = P (t|x) = Π(x) = dD x ρ(x)P (t|x), 1 dD x xρ(x)P (t|x), P (t) P (t) − 1 x−γ (t) 2 e λ , Π(x) 2 1 dd t P (t)e− λ x−γ (t) . (7) (8) (9) (10) In practice we do not have the full density ρ(x), but only a discrete number of samples. 1 So we have to approximate ρ(x) = N δ(x − xi ), where N is the number of samples, i is the sample label, and xi is the multidimensional vector describing the ith sample. Similarly, instead of using a continuous variable t we use a discrete set t ∈ {t1 , t2 , ..., tK } of K points to model the manifold. Note that in (7 − 10) the variable t appears only as an argument for other functions, so we can replace the integral over t by a sum over k = 1..K. Then P (t|x) becomes Pk (xi ),γ(t) is now γ k , and P (t) is Pk . The solution to the resulting set of equations in discrete variables (11 − 14) can be found by an iterative Blahut-Arimoto procedure [11] with an additional EM-like step. Here (n) denotes the iteration step, and α is a coordinate index in RD . The iteration scheme becomes: (n) Pk (n) γk,α = = N 1 N (n) Pk (xi ) = Π(n) (xi ) N 1 1 (n) N P k where α (11) i=1 = (n) xi,α Pk (xi ), (12) i=1 1, . . . , D, K (n) 1 (n) Pk e− λ xi −γ k 2 (13) k=1 (n) (n+1) Pk (xi ) = (n) 2 Pk 1 . e− λ xi −γ k (n) (x ) Π i (14) 0 0 One can initialize γk and Pk (xi ) by choosing K points at random from the data set and 0 letting γk = xi(k) and Pk = 1/K, then use equations (13) and (14) to initialize the 0 association map Pk (xi ). The iteration procedure (11 − 14) is terminated once n−1 n max |γk − γk | < , (15) k where determines the precision with which the manifold points are located. The above algorithm requires the information distortion cost λ = −δD/δI as a parameter. If we want to find the manifold description (M, P (M|X)) for a particular value of information I, we can plot the curve I(λ) and, because it’s monotonic, we can easily find the solution iteratively, arbitrarily close to a given value of I. 4 Evaluating the solution The result of our algorithm is a collection of K manifold points, γk ∈ M ⊂ RD , and a stochastic projection map, Pk (xi ), which maps the points from the data space onto the manifold. Presumably, the manifold M has a well defined intrinsic dimensionality d. If we imagine a little ball of radius r centered at some point on the manifold of intrinsic dimensionality d, and then we begin to grow the ball, the number of points on the manifold that fall inside will scale as rd . On the other hand, this will not be necessarily true for the original data set, since it is more spread out and resembles locally the whole embedding space RD . The Grassberger-Procaccia algorithm [12] captures this intuition by calculating the correlation dimension. First, calculate the correlation integral: 2 C(r) = N (N − 1) N N H(r − |xi − xj |), (16) i=1 j>i where H(x) is a step function with H(x) = 1 for x > 0 and H(x) = 0 for x < 0. This measures the probability that any two points fall within the ball of radius r. Then define 0 original data manifold representation -2 ln C(r) -4 -6 -8 -10 -12 -14 -5 -4 -3 -2 -1 0 1 2 3 4 ln r Figure 2: The semicircle. (a) N = 3150 points randomly scattered around a semicircle of radius R = 20 by a normal process with σ = 1 and the final positions of 100 manifold points. (b) Log log plot of C(r) vs r for both the manifold points (squares) and the original data set (circles). the correlation dimension at length scale r as the slope on the log log plot. dcorr (r) = d log C(r) . d log r (17) For points lying on a manifold the slope remains constant and the dimensionality is fixed, while the correlation dimension of the original data set quickly approaches that of the embedding space as we decrease the length scale. Note that the slope at large length scales always tends to decrease due to finite span of the data and curvature effects and therefore does not provide a reliable estimator of intrinsic dimensionality. 5 5.1 Examples Semi-Circle We have randomly generated N = 3150 data points scattered by a normal distribution with σ = 1 around a semi-circle of radius R = 20 (Figure 2a). Then we ran the algorithm with K = 100 and λ = 8, and terminated the iterative algorithm once the precision = 0.1 had been reached. The resulting manifold is depicted in red. To test the quality of our solution, we calculated the correlation dimension as a function of spatial scale for both the manifold points and the original data set (Figure 2b). As one can see, the manifold solution is of fixed dimensionality (the slope remains constant), while the original data set exhibits varying dimensionality. One should also note that the manifold points have dcorr (r) = 1 well into the territory where the original data set becomes two dimensional. This is what we should expect: at a given information level (in this case, I = 2.8 bits), the information about the second (local) degree of freedom is lost, and the resulting structure is one dimensional. A note about the parameters. Letting K → ∞ does not alter the solution. The information I and distortion D remain the same, and the additional points γk also fall on the semi-circle and are simple interpolations between the original manifold points. This allows us to claim that what we have found is a manifold, and not an agglomeration of clustering centers. Second, varying λ changes the information resolution I(λ): for small λ (high information rate) the local structure becomes important. At high information rate the solution undergoes 3.5 3 3 3 2.5 2.5 2 2.5 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0 0.5 -0.5 0 0 -1 5 -0.5 -0.5 4 1 3 0.5 2 -1 -1 0 1 -0.5 0 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Figure 3: S-shaped sheet in 3D. (a) N = 2000 random points on a surface of an S-shaped sheet in 3D. (b) Normal noise added. XY-plane projection of the data. (c) Optimal manifold points in 3D, projected onto an XY plane for easy visualization. a phase transition, and the resulting manifold becomes two dimensional to take into account the local structure. Alternatively, if we take λ → ∞, the cost of information rate becomes very high and the whole manifold collapses to a single point (becomes zero dimensional). 5.2 S-surface Here we took N = 2000 points covering an S-shaped sheet in three dimensions (Figure 3a), and then scattered the position of each point by adding Gaussian noise. The resulting manifold is difficult to visualize in three dimensions, so we provided its projection onto an XY plane for an illustrative purpose (Figure 3b). After running our algorithm we have recovered the original structure of the manifold (Figure 3c). 6 Discussion The problem of finding low dimensional manifolds in high dimensional data requires regularization to avoid hgihly folded, Peano curve like solutions which are low dimensional in the mathematical sense but fail to capture our geometric intuition. Rather than constraining geometrical features of the manifold (e.g., the curvature) we have constrained the mutual information between positions on the manifold and positions in the original data space, and this is invariant to all invertible coordinate transformations in either space. This approach enforces “smoothness” of the manifold only implicitly, but nonetheless seems to work. Our information theoretic approach has considerable generality relative to methods based on specific smoothing criteria, but requires a separate algorithm, such as LLE, to give the manifold points curvilinear coordinates. For data points not in the original data set, equations (9-10) and (13-14) provide the mapping onto the manifold. Eqn. (7) gives the probability distribution over the latent variable, known in the density modeling literature as “the prior.” The running time of the algorithm is linear in N . This compares favorably with other methods and makes it particularly attractive for very large data sets. The number of manifold points K usually is chosen as large as possible, given the computational constraints, to have a dense sampling of the manifold. However, a value of K << N is often sufficient, since D(λ, K) → D(λ) and I(λ, K) → I(λ) approach their limits rather quickly (the convergence improves for large λ and deteriorates for small λ). In the example of a semi-circle, the value of K = 30 was sufficient at the compression level of I = 2.8 bits. In general, the threshold value for K scales exponentially with the latent dimensionality (rather than with the dimensionality of the embedding space). The choice of λ depends on the desired information resolution, since I depends on λ. Ideally, one should plot the function I(λ) and then choose the region of interest. I(λ) is a monotonically decreasing function, with the kinks corresponding to phase transitions where the optimal manifold abruptly changes its dimensionality. In practice, we may want to run the algorithm only for a few choices of λ, and we would like to start with values that are most likely to correspond to a low dimensional latent variable representation. In this case, as a rule of thumb, we choose λ smaller, but on the order of the largest linear dimension (i.e. λ/2 ∼ Lmax ). The dependence of the optimal manifold M(I) on information resolution reflects the multi-scale nature of the data and should not be taken as a shortcoming. References [1] Bregler, C. & Omohundro, S. (1995) Nonlinear image interpolation using manifold learning. Advances in Neural Information Processing Systems 7. MIT Press. [2] Hastie, T. & Stuetzle, W. (1989) Principal curves. Journal of the American Statistical Association, 84(406), 502-516. [3] Roweis, S. & Saul, L. (2000) Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326. [4] Tenenbaum, J., de Silva, V., & Langford, J. (2000) A global geometric framework for nonlinear dimensionality reduction. Science, 290 , 2319–2323. [5] Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441,498-520. [6] Bishop, C., Svensen, M. & Williams, C. (1998) GTM: The generative topographic mapping. Neural Computation,10, 215–234. [7] Brand, M. (2003) Charting a manifold. Advances in Neural Information Processing Systems 15. MIT Press. [8] Scholkopf, B., Smola, A. & Muller K-R. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319. [9] Kramer, M. (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37, 233-243. [10] Belkin M. & Niyogi P. (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396. [11] Blahut, R. (1972) Computation of channel capacity and rate distortion function. IEEE Trans. Inform. Theory, IT-18, 460-473. [12] Grassberger, P., & Procaccia, I. (1983) Characterization of strange attractors. Physical Review Letters, 50, 346-349.
2 0.75984961 120 nips-2003-Locality Preserving Projections
Author: Xiaofei He, Partha Niyogi
Abstract: Many problems in information processing involve some form of dimensionality reduction. In this paper, we introduce Locality Preserving Projections (LPP). These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data set. LPP should be seen as an alternative to Principal Component Analysis (PCA) – a classical linear technique that projects the data along the directions of maximal variance. When the high dimensional data lies on a low dimensional manifold embedded in the ambient space, the Locality Preserving Projections are obtained by finding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the manifold. As a result, LPP shares many of the data representation properties of nonlinear techniques such as Laplacian Eigenmaps or Locally Linear Embedding. Yet LPP is linear and more crucially is defined everywhere in ambient space rather than just on the training data points. This is borne out by illustrative examples on some high dimensional data sets.
3 0.64381337 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
Author: Jakob J. Verbeek, Sam T. Roweis, Nikos A. Vlassis
Abstract: We propose a non-linear Canonical Correlation Analysis (CCA) method which works by coordinating or aligning mixtures of linear models. In the same way that CCA extends the idea of PCA, our work extends recent methods for non-linear dimensionality reduction to the case where multiple embeddings of the same underlying low dimensional coordinates are observed, each lying on a different high dimensional manifold. We also show that a special case of our method, when applied to only a single manifold, reduces to the Laplacian Eigenmaps algorithm. As with previous alignment schemes, once the mixture models have been estimated, all of the parameters of our model can be estimated in closed form without local optima in the learning. Experimental results illustrate the viability of the approach as a non-linear extension of CCA. 1
4 0.5094254 128 nips-2003-Minimax Embeddings
Author: Matthew Brand
Abstract: Spectral methods for nonlinear dimensionality reduction (NLDR) impose a neighborhood graph on point data and compute eigenfunctions of a quadratic form generated from the graph. We introduce a more general and more robust formulation of NLDR based on the singular value decomposition (SVD). In this framework, most spectral NLDR principles can be recovered by taking a subset of the constraints in a quadratic form built from local nullspaces on the manifold. The minimax formulation also opens up an interesting class of methods in which the graph is “decorated” with information at the vertices, offering discrete or continuous maps, reduced computational complexity, and immunity to some solution instabilities of eigenfunction approaches. Apropos, we show almost all NLDR methods based on eigenvalue decompositions (EVD) have a solution instability that increases faster than problem size. This pathology can be observed (and corrected via the minimax formulation) in problems as small as N < 100 points. 1 Nonlinear dimensionality reduction (NLDR) . Spectral NLDR methods are graph embedding problems where a set of N points X = [x1 , · · · , xN ] ∈ RD×N sampled from a low-dimensional manifold in a ambient space RD is reparameterized by imposing a neighborhood graph G on X and embedding the graph with minimal distortion in a “parameterization” space Rd , d < D. Typically the graph is sparse and local, with edges connecting points to their immediate neighbors. The embedding must keep these edges short or preserve their length (for isometry) or angles (for conformality). The graph-embedding problem was first introduced as a least-squares problem by Tutte [1], and as an eigenvalue problem by Fiedler [2]. The use of sparse graphs to generate metrics for least-squares problems has been studied intensely in the following three decades (see [3]). Modern NLDR methods use graph constraints to generate a metric in a space of embeddings RN . Eigenvalue decomposition (EVD) gives the directions of least or greatest variance under this metric. Typically a subset of d extremal eigenvectors gives the embedding of N points in Rd parameterization space. This includes the IsoMap family [4], the locally linear embedding (LLE) family [5,6], and Laplacian methods [7,8]. Using similar methods, the Automatic Alignment [6] and Charting [9] algorithms embed local subspaces instead of points, and by combining subspace projections thus obtain continuous maps between RD and Rd . This paper introduces a general algebraic framework for computing optimal embeddings directly from graph constraints. The aforementioned methods can can be recovered as special cases. The framework also suggests some new methods with very attractive properties, including continuous maps, reduced computational complexity, and control over the degree of conformality/isometry in the desired map. It also eliminates a solution instability that is intrinsic to EVD-based approaches. A perturbational analysis quantifies the instability. 2 Minimax theorem for graph embeddings We begin with neighborhood graph specified by a nondiagonal weighted adjacency matrix M ∈ RN×N that has the data-reproducing property XM = X (this can be relaxed to XM ≈ X in practice). The graph-embedding and NLDR literatures offer various constructions of M, each appropriate to different sets of assumptions about the original embedding and its sampling X (e.g., isometry, local linearity, noiseless samples, regular sampling, etc.). Typically Mi j = 0 if points i, j are nearby on the intrinsic manifold and |Mi j | is small or zero otherwise. Each point is taken to be a linear or convex combination of its neighbors, and thus M specifies manifold connectivity in the sense that any nondegenerate embedding Y that satisfies YM ≈ Y with small residual YM − Y F will preserve this connectivity and the structure of local neighborhoods. For example, in barycentric embeddings, each point is the average of its neighbors and thus Mi j = 1/k if vertex i is connected to vertex j (of degree k). We will also consider three optional constraints on the embedding : 1. A null-space restriction, where the solution must be outside to the column-space of C ∈ RN×M , M < N. For example, it is common to stipulate that the solution Y be centered, i.e., YC = 0 for C = 1, the constant vector. 2. A basis restriction, where the solution must be a linear combination of the rows of basis Z ∈ RK×N , K ≤ N. This can be thought of as information placed at the vertices of the graph that serves as example inputs for a target NLDR function. We will use this to construct dimension-reducing radial basis function networks. 3. A metric Σ ∈ RN×N that determines how error is distributed over the points. For example, it might be important that boundary points have less error. We assume that Σ is symmetric positive definite and has factorization Σ = AA (e.g., A could be a Cholesky factor of Σ). In most settings, the optional matrices will default to the identity matrix. In this context, we define the per-dimension embedding error of row-vector yi ∈ rows(Y) to be . EM (yi ) = max yi ∈range(Z),, K∈RM×N (yi (M + CD) − yi )A yi A (1) where D is a matrix constructed by an adversary to maximize the error. The optimizing yi is a vector inside the subspace spanned by the rows of Z and outside the subspace spanned by the columns of C, for which the reconstruction residual yi M−yi has smallest norm w.r.t. the metric Σ. The following theorem identifies the optimal embedding Y for any choice of M, Z, C, Σ: Minimax solution: Let Q ∈ SK×P be a column-orthonormal basis of the null-space of the rows of ZC, with P = K − rank(C). Let B ∈ RP×P be a square factor satisfying B B = Q ZΣZ Q, e.g., a Cholesky factor (or the “R” factor in QR-decomposition of (Q ZA) ). Compute the left singular vectors U ∈ SN×N of Udiag(s)V = B− Q Z(I − M)A, with . singular values s = [s1 , · · · , sP ] ordered s1 ≤ s2 ≤ · · · ≤ s p . Using the leading columns U1:d of U, set Y = U1:d B− Q Z. Theorem 1. Y is the optimal (minimax) embedding in Rd with error [s1 , · · · , sd ] 2 : . Y = U1:d B− Q Z = arg min ∑ EM (yi )2 with EM (yi ) = si . Y∈Rd×N y ∈rows(Y) i (2) Appendix A develops the proof and other error measures that are minimized. Local NLDR techniques are easily expressed in this framework. When Z = A = I, C = [], and M reproduces X through linear combinations with M 1 = 1, we recover LLE [5]. When Z = I, C = [], I − M is the normalized graph Laplacian, and A is a diagonal matrix of vertex degrees, we recover Laplacian eigenmaps [7]. When further Z = X we recover locally preserving projections [8]. 3 Analysis and generalization of charting The minimax construction of charting [9] takes some development, but offers an interesting insight into the above-mentioned methods. Recall that charting first solves for a set of local affine subspace axes S1 ∈ RD×d , S2 , · · · at offsets µ1 ∈ RD , µ2 , · · · that best cover the data and vary smoothly over the manifold. Each subspace offers a chart—a local parameterization of the data by projection onto the local axes. Charting then constructs a weighted mixture of affine projections that merges the charts into a global parameterization. If the data manifold is curved, each projection will assign a point a slightly different embedding, so the error is measured as the variance of these proposed embeddings about their mean. This maximizes consistency and tends to produce isometric embeddings; [9] discusses ways to explicitly optimize the isometry of the embedding. Under the assumption of isometry, the charting error is equivalent to the sumsquared displacements of an embedded point relative to its immediate neighbors (summed over all neighborhoods). To construct the same error criteria in the minimax setting, let xi−k , · · · , xi , · · · , xi+k denote points in the ith neighborhood and let the columns of Vi ∈ R(2k+1)×d be an orthonormal basis of rows of the local parameterization Si [xi−k , · · · , xi , · · · , xi+k ]. Then a nonzero reparameterization will satisfy [yi−k , · · · , yi , · · · , yi+k ]Vi Vi = [yi−k , · · · , yi , · · · , yi+k ] if and only if it preserves the relative position of the points in the local parameterization. Conversely, any relative displacements of the points are isolated by the formula [yi−k , · · · , yi , · · · , yi+k ](I − Vi Vi ). Minimizing the Frobenius norm of this expression is thus equivalent to minimizing the local error in charting. We sum these constraints over all neighborhoods to obtain the constraint matrix M = I − ∑i Fi (I − Vi Vi )Fi , where (Fi )k j = 1 iff the jth point of the ith neighborhood is the kth point of the dataset. Because Vi Vi and (I − Vi Vi ) are complementary, it follows that the error criterion of any local NLDR method (e.g., LLE, Laplacian eigenmaps, etc.) must measure the projection of the embedding onto some subspace of (I − Vi Vi ). To construct a continuous map, charting uses an overcomplete radial basis function (RBF) representation Z = [z(x1 ), z(x2 ), · · · z(xN )], where z(x) is a vector that stacks z1 (x), z2 (x), etc., and pm (x) . Km (x − µm ) , zm (x) = 1 ∑m pm (x) −1 . pm (x) = N (x|µm , Σm ) ∝ e−(x−µm ) Σm (x−µm )/2 (3) (4) and Km is any local linear dimensionality reducer, typically Sm itself. Each column of Z contains many “views” of the same point that are combined to give its low-dimensional embedding. Finally, we set C = 1, which forces the embedding of the full data to be centered. Applying the minimax solution to these constraints yields the RBF network mixing ma. trix, f (x) = U1:d B− Q z(x). Theorem 1 guarantees that the resulting embedding is leastsquares optimal w.r.t. Z, M, C, A at the datapoints f (xi ), and because f (·) is an affine transform of z(·) it smoothly interpolates the embedding between points. There are some interesting variants: Kernel embeddings of the twisted swiss roll generalized EVD minimax SVD UR corner detail LL corner detail Fig. 1. Minimax and generalized EVD solution for kernel eigenmap of a non-developable swiss roll. Points are connected into a grid which ideally should be regular. The EVD solution shows substantial degradation. Insets detail corners where the EVD solution crosses itself repeatedly. The border compression is characteristic of Laplacian constraints. One-shot charting: If we set the local dimensionality reducers to the identity matrix (all Km = I), then the minimax method jointly optimizes the local dimensionality reduction to charts and the global coordination of the charts (under any choice of M). This requires that rows(Z) ≤ N for a fully determined solution. Discrete isometric charting: If Z = I then we directly obtain a discrete isometric embedding of the data, rather than a continuous map, making this a local equivalent of IsoMap. Reduced basis charting: Let Z be constructed using just a small number of kernels randomly placed on the data manifold, such that rows(Z) N. Then the size of the SVD problem is substantially reduced. 4 Numerical advantage of minimax method Note that the minimax method projects the constraint matrix M into a subspace derived from C and Z and decomposes it there. This suppresses unwanted degrees of freedom (DOFs) admitted by the problem constraints, for example the trivial R0 embedding where all points are mapped to a single point yi = N −1/2 . The R0 embedding serves as a translational DOF in the solution. LLE- and eigenmap-based methods construct M to have a constant null-space so that the translational DOF will be isolated in the EVD as null eigenvalue paired to a constant eigenvector, which is then discarded. However, section 4.1 shows that this construction makes the EVD increasingly unstable as problem size grows and/or the data becomes increasing amenable to low-residual embeddings, ultimately causing solution collapse. As the next paragraph demonstrates, the problem is exacerbated when embedding w.r.t. a basis Z (via the equivalent generalized eigenproblem), partly because the eigenvector associated with the unwanted DOF can have arbitrary structure. In all cases the problem can be averted by using the minimax formulation with C = 1 to suppress the DOF. A 2D plane was embedded in 3D with a curl, a twist, and 2.5% Gaussian noise, then regularly sampled at 900 points. We computed a kernelized Laplacian eigenmap using 70 random points as RBF centers, i.e., a continous map using M derived from the graph Laplacian and Z constructed as above. The map was computed both via the minimax (SVD) method and via the equivalent generalized eigenproblem, where the translational degree of freedom must be removed by discarding an eigenvector from the solution. The two solutions are algebraically equivalent in every other regard. A variety of eigensolvers were tried; we took −5 excess energy x 10 Eigen spectrum compared to minimax spectrum 15 10 5 0 −5 Eigen spectrum compared to minimax spectrum 2 15 deviation excess energy x 10 10 5 100 200 eigenvalue Error in null embedding −5 x 10 0 −2 −4 −6 −8 0 100 −5 eigenvalue Error in null embedding 200 100 200 300 400 500 point 600 700 800 900 Fig. 2. Excess energy in the eigenspectrum indicates that the translational DOF has contam2 inated many eigenvectors. If the EVD had successfully isolated the unwanted DOF, then its 0 remaining eigenvalues should be identical to those derived from the minimax solution. The −2 −4 graph at left shows the difference in the eigenspectra. The graph at right shows the EVD −6 solution’s deviation from the translational vector y0 = 1 · N −1/2 ≈ .03333. If the numer−8 ics were100 200 the line would be flat, but in practice the deviation is significant enough perfect 300 400 500 600 700 800 900 point (roughly 1% of the diameter of the embedding) to noticably perturb points in figure 1. deviation x 10 the best result. Figure 1 shows that the EVD solution exhibits many defects, particularly a folding-over of the manifold at the top and bottom edges and at the corners. Figure 2 shows that the noisiness of the EVD solution is due largely to mutual contamination of numerically unstable eigenvectors. 4.1 Numerical instability of eigen-methods The following theorem uses tools of matrix perturbation theory to show that as the problem size increases, the desired and unwanted eigenvectors become increasingly wobbly and gradually contaminate each other, leading to degraded solutions. More precisely, the low-order eigenvalues are ill-conditioned and exhibit multiplicities that may be true (due to noiseless samples from low-curvature manifolds) or false (due to numerical noise). Although in many cases some post-hoc algebra can “filter” the unwanted components out of the contaminated eigensolution, it is not hard to construct cases where the eigenvectors cannot be cleanly separated. The minimax formulation is immune to this problem because it explicitly suppresses the gratuitous component(s) before matrix decomposition. Theorem 2. For any finite numerical precision, as the number of points N increases, the Frobenius norm of numerical noise in the null eigenvector v0 can grow as O(N 3/2 ), and the eigenvalue problem can approach a false multiplicity at a rate as fast as O(N 3/2 ), at which point the eigenvectors of interest—embedding and translational—are mutually contaminated and/or have an indeterminate eigenvalue ordering. Please see appendix B for the proof. This theorem essentially lower-bounds an upperbound on error; examples can be constructed in which the problem is worse. For example, it can be shown analytically that when embedding points drawn from the simple curve xi = [a, cos πa] , a ∈ [0, 1] with K = 2 neighbors, instabilities cannot be bounded better than O(N 5/2 ); empirically we see eigenvector mixing with N < 100 points and we see it grow at the rate ≈ O(N 4 )—in many different eigensolvers. At very large scales, more pernicious instabilities set in. E.g., by N = 20000 points, the solution begins to fold over. Although algebraic multiplicity and instability of the eigenproblem is conceptually a minor oversight in the algorithmic realizations of eigenfunction embeddings, as theorem 2 shows, the consequences are eventually fatal. 5 Summary One of the most appealing aspects of the spectral NLDR literature is that algorithms are usually motivated from analyses of linear operators on smooth differentiable manifolds, e.g., [7]. Understandably, these analysis rely on assumptions (e.g., smoothness or isometry or noiseless sampling) that make it difficult to predict what algorithmic realizations will do when real, noisy data violates these assumptions. The minimax embedding theorem provides a complete algebraic characterization of this discrete NLDR problem, and provides a solution that recovers numerically robustified versions of almost all known algorithms. It offers a principled way of constructing new algorithms with clear optimality properties and good numerical conditioning—notably the construction of a continuous NLDR map (an RBF network) in a one-shot optimization ( SVD ). We have also shown how to cast several local NLDR principles in this framework, and upgrade these methods to give continuous maps. Working in the opposite direction, we sketched the minimax formulation of isometric charting and showed that its constraint matrix contains a superset of all the algebraic constraints used in local NLDR techniques. References 1. W.T. Tutte. How to draw a graph. Proc. London Mathematical Society, 13:743–768, 1963. 2. Miroslav Fiedler. A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czech. Math. Journal, 25:619–633, 1975. 3. Fan R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. American Mathematical Society, 1997. 4. Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000. 5. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. 6. Yee Whye Teh and Sam T. Roweis. Automatic alignment of hidden representations. In Proc. NIPS-15, 2003. 7. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. volume 14 of Advances in Neural Information Processing Systems, 2002. 8. Xiafei He and Partha Niyogi. Locality preserving projections. Technical Report TR-2002-09, University of Chicago Computer Science, October 2002. 9. Matthew Brand. Charting a manifold. volume 15 of Advances in Neural Information Processing Systems, 2003. 10. G.W. Stewart and Ji-Guang Sun. Matrix perturbation theory. Academic Press, 1990. A Proof of minimax embedding theorem (1) The burden of this proof is carried by supporting lemmas, below. To emphasize the proof strategy, we give the proof first; supporting lemmas follow. Proof. Setting yi = li Z, we will solve for li ∈ columns(L). Writing the error in terms of li , EM (li ) = max K∈RM×N li Z(I − M − CK)A li Z(I − M)A − li ZCKA = max . M×N li ZA li ZA K∈R (5) The term li ZCKA produces infinite error unless li ZC = 0, so we accept this as a constraint and seek li Z(I − M)A min . (6) li ZA li ZC=0 By lemma 1, that orthogonality is satisfied by solving the problem in the space orthogonal . to ZC; the basis for this space is given by columns of Q = null((ZC) ). By lemma 2, the denominator of the error specifies the metric in solution space to be ZAA Z ; when the problem is projected into the space orthogonal to ZC it becomes Q (ZAA Z )Q. Nesting the “orthogonally-constrained-SVD” construction of lemma 1 inside the “SVD-under-a-metric” lemma 2, we obtain a solution that uses the correct metric in the orthogonal space: B B = Q ZAA Z Q − Udiag(s)V = B (7) {Q(Z(I − M)A)} (8) L = QB−1 U (9) where braces indicate the nesting of lemmas. By the “best-projection” lemma (#3), if we order the singular values by ascending magnitude, L1:d = arg min J∈RN×d ∑ji ∈cols(J) ( j Z(I − M)A / j )2 ZΣZ The proof is completed by making the substitutions L Z → Y and x A → x Σ = AA ), and leaving off the final square root operation to obtain (Y )1:d = arg min ∑ji ∈cols(J) j (I − M) Σ / j J∈RN×d (10) Σ (for 2 Σ . (11) Lemma 1. Orthogonally constrained SVD: The left singular vectors L of matrix M under . SVD the constraint U C = 0 are calculated as Q = null(C ), Udiag(s)V ← Q M, L = QU. Proof. First observe that L is orthogonal to C: By definition, the null-space basis satisfies Q C = 0, thus L C = U Q C = 0. Let J be an orthonormal basis for C, with J J = I and Q J = 0. Then Ldiag(s)V = QQ M = (I − JJ )M, the orthogonal projector of C applied to M, proving that the SVD captures the component of M that is orthogonal to C. Lemma 2. SVD with respect to a metric: The vectors li ∈ L, vi ∈ V that diagonalize matrix M with respect to positive definite column-space metric Σ are calculated as B B ← Σ, SVD . Udiag(s)V ← B− M, L = B−1 U satisfy li M / li Σ = si and extremize this form for the extremal singular values smin , smax . Proof. By construction, L and V diagonalize M: L MV = (B−1 U) MV = U (B− M)V = diag(s) (12) B− and diag(s)V = M. Forming the gram matrices of both sides of the last line, we obtain the identity Vdiag(s)2 V = M B−1 B− M = M Σ−1 M, which demonstrates that si ∈ s are the singular values of M w.r.t. column-space metric Σ. Finally, L is orthonormal w.r.t. the metric Σ, because L 2 = L ΣL = U B− B BB−1 U = I. Consequently, Σ l M / l Σ = l M /1 = si vi and by the Courant-Hilbert theorem, smax = max l M / l Σ ; = si . smin = min l M / l Σ . l l (13) (14) Lemma 3. Best projection: Taking L and s from lemma 2, let the columns of L and elements of s be sorted so that s1 ≥ s2 ≥ · · · ≥ sN . Then for any dimensionality 1 ≤ d ≤ N, . L1:d = [l1 , · · · , ld ] = arg max J M (J ΣJ)−1 (15) J∈RN×d = arg max F (16) ∑ji ∈cols(J) ( j M / j Σ )2 (17) J∈RN×d |J ΣJ=I = arg max J∈RN×d J M with the optimum value of all right hand sides being (∑d s2 )1/2 . If the sort order is rei=1 i versed, the minimum of this form is obtained. Proof. By the Eckart-Young-Mirsky theorem, if U MV = diag(s) with singular values . sorted in descending order, then U1:d = [u1 , · · · , ud ] = arg maxU∈SN×d U M F . We first extend this to a non-orthonogonal basis J under a Mahalonobis norm: maxJ∈RN×d J M (J J)−1 = maxU∈SN×d U M F (18) because J M 2 (J J)−1 = trace(M J(J J)−1 J M) = trace(M JJ+ (JJ+ ) M) = (JJ+ )M 2 = UU M 2 = U M 2 since JJ+ is a (symmetric) orthogonal proF F F jector having binary eigenvalues λ ∈ {0, 1} and therefore it is the gram of an thin orthogonal matrix. We then impose a metric Σ on the column-space of J to obtain the first criterion (equation 15), which asks what maximizes variance in J M while minimizing the norm of J w.r.t. metric Σ. Here it suffices to substitute in the leading (resp., trailing) columns of L and verify that the norm is maximized (resp., minimized). Expanding, L1:d M 2 ΣL )−1 = trace((L1:d M) (L1:d ΣL1:d )−1 (L1:d M)) = (L 1:d 1:d trace((L1:d M) I(L1:d M)) = trace((diag(s1:d )V1:d ) (diag(s1:d )V1:d )) = s1:d 2 . Again, by the Eckart-Young-Mirsky theorem, these are the maximal variance-preserving projections, so the first criterion is indeed maximized by setting J to the columns in L corresponding to the largest values in s. Criterion #2 restates the first criterion with the set of candidates for J restricted to (the hyperelliptical manifold of) matrices that reduce the metric on the norm to the identity matrix (thereby recovering the Frobenius norm). Criterion #3 criterion merely expands the above trace by individual singular values. Note that the numerator and denominator can have different metrics because they are norms in different spaces, possibly of different dimension. Finally, that the trailing d eigenvectors minimize these criteria follows directly from the fact that leading N − d singular values account for the maximal part of the variance. B Proof of instability theorem (2) Proof. When generated from a sparse graph with average degree K, weighted connectivity matrix W is sparse and has O(NK) entries. Since the graph vertices represent samples from a smooth manifold, increasing the sampling density N does not change the distribution of magnitudes in W. Consider a perturbation of the nonzero values in W, e.g., W → W + E due to numerical noise E created by finite machine precision. By the weak law of large √ numbers, the Frobenius norm of the sparse perturbation grows as E F ∼ O( N). However the t th -smallest nonzero eigenvalue λt (W) grows as λt (W) = vt Wvt ∼ O(N −1 ), because elements of corresponding eigenvector vt grow as O(N −1/2 ) and only K of those elements are multiplied by nonzero values to form each element of Wvt . In sum, the perturbation E F grows while the eigenvalue λt (W) shrinks. In linear embedding algorithms, . the eigengap of interest is λgap = λ1 − λ0 . The tail eigenvalue λ0 = 0 by construction but it is possible that λ0 > 0 with numerical error, thus λgap ≤ λ1 . Combining these facts, the ratio between the perturbation and the eigengap grows as E F /λgap ∼ O(N 3/2 ) or faster. Now consider the shifted eigenproblem I − W with leading (maximal) eigenvalues 1 − λ0 ≥ 1 − λ1 ≥ · · · and unchanged eigenvectors. From matrix perturbation the. ory [10, thm. V.2.8], when W is perturbed to W = W + E, the change in the lead√ ing eigenvalue from 1 − λ0 to 1 − λ0 is bounded as |λ0 − λ0 | ≤ 2 E F and similarly √ √ 1 − λ1 ≤ 1 − λ1 + 2 E F . Thus λgap ≥ λgap − 2 E F . Since E F /λgap ∼ O(N 3/2 ), the right hand side of the gap bound goes negative at a supralinear rate, implying that the eigenvalue ordering eventually becomes unstable with the possibility of the first and second eigenvalue/vector pairs being swapped. Mutual contamination of the eigenvectors happens well before: Under general (dense) conditions, the change in the eigenvector v0 is bounded E F as v0 − v0 ≤ |λ −λ4 |−√2 E [10, thm. V.2.8]. (This bound is often tight enough to serve F 0 1 as a good approximation.) Specializing this to the sparse embedding matrix, we find that √ √ O( N) O( N) the bound weakens to v0 − 1 · N −1/2 ∼ O(N −1 )−O(√N) > O(N −1 ) = O(N 3/2 ).
5 0.49896595 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
6 0.47164103 92 nips-2003-Information Bottleneck for Gaussian Variables
7 0.45494735 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
8 0.42407563 88 nips-2003-Image Reconstruction by Linear Programming
9 0.41551229 172 nips-2003-Semi-Supervised Learning with Trees
10 0.41035601 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
11 0.37383178 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
12 0.36920401 81 nips-2003-Geometric Analysis of Constrained Curves
13 0.33835527 164 nips-2003-Ranking on Data Manifolds
14 0.31846941 51 nips-2003-Design of Experiments via Information Theory
15 0.31225616 30 nips-2003-Approximability of Probability Distributions
16 0.31156746 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
17 0.28331617 113 nips-2003-Learning with Local and Global Consistency
18 0.27652347 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors
19 0.25841838 71 nips-2003-Fast Embedding of Sparse Similarity Graphs
20 0.25621563 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
topicId topicWeight
[(0, 0.059), (11, 0.031), (29, 0.033), (30, 0.039), (35, 0.044), (53, 0.187), (69, 0.022), (71, 0.058), (76, 0.049), (78, 0.023), (85, 0.065), (91, 0.075), (96, 0.204), (99, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.8743493 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach
Author: Denis V. Chigirev, William Bialek
Abstract: We introduce an information theoretic method for nonparametric, nonlinear dimensionality reduction, based on the infinite cluster limit of rate distortion theory. By constraining the information available to manifold coordinates, a natural probabilistic map emerges that assigns original data to corresponding points on a lower dimensional manifold. With only the information-distortion trade off as a parameter, our method determines the shape of the manifold, its dimensionality, the probabilistic map and the prior that provide optimal description of the data. 1 A simple example Some data sets may not be as complicated as they appear. Consider the set of points on a plane in Figure 1. As a two dimensional set, it requires a two dimensional density ρ(x, y) for its description. Since the data are sparse the density will be almost singular. We may use a smoothing kernel, but then the data set will be described by a complicated combination of troughs and peaks with no obvious pattern and hence no ability to generalize. We intuitively, however, see a strong one dimensional structure (a curve) underlying the data. In this paper we attempt to capture this intuition formally, through the use of the infinite cluster limit of rate distortion theory. Any set of points can be embedded in a hypersurface of any intrinsic dimensionality if we allow that hypersurface to be highly “folded.” For example, in Figure 1, any curve that goes through all the points gives a one dimensional representation. We would like to avoid such solutions, since they do not help us discover structure in the data. Looking for a simpler description one may choose to penalize the curvature term [1]. The problem with this approach is that it is not easily generalized to multiple dimensions, and requires the dimensionality of the solution as an input. An alternative approach is to allow curves of all shapes and sizes, but to send the reduced coordinates through an information bottleneck. With a fixed number of bits, position along a highly convoluted curve becomes uncertain. This will penalize curves that follow the data too closely (see Figure 1). There are several advantages to this approach. First, it removes the artificiality introduced by Hastie [2] of adding to the cost function only orthogonal errors. If we believe that data points fall out of the manifold due to noise, there is no reason to treat the projection onto the manifold as exact. Second, it does not require the dimension- 9 8 Figure 1: Rate distortion curve for a data set of 25 points (red). We used 1000 points to represent the curve which where initialized by scattering them uniformly on the plane. Note that the produced curve is well defined, one dimensional and smooth. 7 6 5 4 3 2 1 0 2 4 6 8 10 12 ality of the solution manifold as an input. By adding extra dimensions, one quickly looses the precision with which manifold points are specified (due to the fixed information bottleneck). Hence, the optimal dimension emerges naturally. This also means that the method works well in many dimensions with no adjustments. Third, the method handles sparse data well. This is important since in high dimensional spaces all data sets are sparse, i.e. they look like points in Figure 1, and the density estimation becomes impossible. Luckily, if the data are truly generated by a lower dimensional process, then density estimation in the data space is not important (from the viewpoint of prediction or any other). What is critical is the density of the data along the manifold (known in latent variable modeling as a prior), and our algorithm finds it naturally. 2 Latent variable models and dimensionality reduction Recently, the problem of reducing the dimensionality of a data set has received renewed attention [3,4]. The underlying idea, due to Hotelling [5], is that most of the variation in many high dimensional data sets can often be explained by a few latent variables. Alternatively, we say that rather than filling the whole space, the data lie on a lower dimensional manifold. The dimensionality of this manifold is the dimensionality of the latent space and the coordinate system on this manifold provides the latent variables. Traditional tools of principal component analysis (PCA) and factor analysis (FA) are still the most widely used methods in data analysis. They project the data onto a hyperplane, so the reduced coordinates are easy to interpret. However, these methods are unable to deal with nonlinear correlations in a data set. To accommodate nonlinearity in a data set, one has to relax the assumption that the data is modeled by a hyperplane, and allow a general low dimensional manifold of unknown shape and dimensionality. The same questions that we asked in the previous section apply here. What do we mean by requiring that “the manifold models the data well”? In the next section, we formalize this notion by defining the manifold description of data as a doublet (the shape of the manifold and the projection map). Note that we do not require the probability distribution over the manifold (known for generative models [6,7] as a prior distribution over the latent variables and postulated a priori). It is completely determined by the doublet. Nonlinear correlations in data can also be accommodated implicitly, without constructing an actual low dimensional manifold. By mapping the data from the original space to an even higher dimensional feature space, we may hope that the correlations will become linearized and PCA will apply. Kernel methods [8] allow us to do this without actually constructing an explicit map to feature space. They introduce nonlinearity through an a priori nonlinear kernel. Alternatively, autoassociative neural networks [9] force the data through a bottleneck (with an internal layer of desired dimensionality) to produce a reduced description. One of the disadvantages of these methods is that the results are not easy to interpret. Recent attempts to describe a data set with a low dimensional representation generally follow into two categories: spectral methods and density modeling methods. Spectral methods (LLE [3], ISOMAP [4], Laplacian eigenmaps [10]) give reduced coordinates of an a priori dimensionality by introducing a quadratic cost function in reduced coordinates (hence eigenvectors are solutions) that mimics the relationships between points in the original data space (geodesic distance for ISOMAP, linear reconstruction for LLE). Density modeling methods (GTM [6], GMM [7]) are generative models that try to reproduce the data with fewer variables. They require a prior and a parametric generative model to be introduced a priori and then find optimal parameters via maximum likelihood. The approach that we will take is inspired by the work of Kramer [9] and others who tried to formulate dimensionality reduction as a compression problem. They tried to solve the problem by building an explicit neural network encoder-decoder system which restricted the information implicitly by limiting the number of nodes in the bottleneck layer. Extending their intuition with the tools of information theory, we recast dimensionality reduction as a compression problem where the bottleneck is the information available to manifold coordinates. This allows us to define the optimal manifold description as that which produces the best reconstruction of the original data set, given that the coordinates can only be transmitted through a channel of fixed capacity. 3 Dimensionality reduction as compression Suppose that we have a data set X in a high dimensional state space RD described by a density function ρ(x). We would like to find a “simplified” description of this data set. One may do so by visualizing a lower dimensional manifold M that “almost” describes the data. If we have a manifold M and a stochastic map PM : x → PM (µ|x) to points µ on the manifold, we will say that they provide a manifold description of the data set X. Note that the stochastic map here is well justified: if a data point does not lie exactly on the manifold then we should expect some uncertainty in the estimation of the value of its latent variables. Also note that we do not need to specify the inverse (generative) map: M → RD ; it can be obtained by Bayes’ rule. The manifold description (M, PM ) is a less than faithful representation of the data. To formalize this notion we will introduce the distortion measure D(M, PM , ρ): ρ(x)PM (µ|x) x − µ 2 dD xDµ. D(M, PM , ρ) = x∈RD (1) µ∈M Here we have assumed the Euclidean distance function for simplicity. The stochastic map, PM (µ|x), together with the density, ρ(x), define a joint probability function P (M, X) that allows us to calculate the mutual information between the data and its manifold representation: I(X, M) = P (x, µ) log x∈X µ∈M P (x, µ) dD xDµ. ρ(x)PM (µ) (2) This quantity tells us how many bits (on average) are required to encode x into µ. If we view the manifold representation of X as a compression scheme, then I(X, M) tells us the necessary capacity of the channel needed to transmit the compressed data. Ideally, we would like to obtain a manifold description {M, PM (M|X)} of the data set X that provides both a low distortion D(M, PM , ρ) and a good compression (i.e. small I(X, M)). The more bits we are willing to provide for the description of the data, the more detailed a manifold that can be constructed. So there is a trade off between how faithful a manifold representation can be and how much information is required for its description. To formalize this notion we introduce the concept of an optimal manifold. DEFINITION. Given a data set X and a channel capacity I, a manifold description (M, PM (M|X)) that minimizes the distortion D(M, PM , X), and requires only information I for representing an element of X, will be called an optimal manifold M(I, X). Note that another way to define an optimal manifold is to require that the information I(M, X) is minimized while the average distortion is fixed at value D. The shape and the dimensionality of optimal manifold depends on our information resolution (or the description length that we are willing to allow). This dependence captures our intuition that for real world, multi-scale data, a proper manifold representation must reflect the compression level we are trying to achieve. To find the optimal manifold (M(I), PM(I) ) for a given data set X, we must solve a constrained optimization problem. Let us introduce a Lagrange multiplier λ that represents the trade off between information and distortion. Then optimal manifold M(I) minimizes the functional: F(M, PM ) = D + λI. (3) Let us parametrize the manifold M by t (presumably t ∈ Rd for some d ≤ D). The function γ(t) : t → M maps the points from the parameter space onto the manifold and therefore describes the manifold. Our equations become: D = dD x dd t ρ(x)P (t|x) x − γ(t) 2 , I = dD x dd t ρ(x)P (t|x) log P (t|x) , P (t) F(γ(t), P (t|x)) = D + λI. (4) (5) (6) Note that both information and distortion measures are properties of the manifold description doublet {M, PM (M|X)} and are invariant under reparametrization. We require the variations of the functional to vanish for optimal manifolds δF/δγ(t) = 0 and δF/δP (t|x) = 0, to obtain the following set of self consistent equations: P (t) = γ(t) = P (t|x) = Π(x) = dD x ρ(x)P (t|x), 1 dD x xρ(x)P (t|x), P (t) P (t) − 1 x−γ (t) 2 e λ , Π(x) 2 1 dd t P (t)e− λ x−γ (t) . (7) (8) (9) (10) In practice we do not have the full density ρ(x), but only a discrete number of samples. 1 So we have to approximate ρ(x) = N δ(x − xi ), where N is the number of samples, i is the sample label, and xi is the multidimensional vector describing the ith sample. Similarly, instead of using a continuous variable t we use a discrete set t ∈ {t1 , t2 , ..., tK } of K points to model the manifold. Note that in (7 − 10) the variable t appears only as an argument for other functions, so we can replace the integral over t by a sum over k = 1..K. Then P (t|x) becomes Pk (xi ),γ(t) is now γ k , and P (t) is Pk . The solution to the resulting set of equations in discrete variables (11 − 14) can be found by an iterative Blahut-Arimoto procedure [11] with an additional EM-like step. Here (n) denotes the iteration step, and α is a coordinate index in RD . The iteration scheme becomes: (n) Pk (n) γk,α = = N 1 N (n) Pk (xi ) = Π(n) (xi ) N 1 1 (n) N P k where α (11) i=1 = (n) xi,α Pk (xi ), (12) i=1 1, . . . , D, K (n) 1 (n) Pk e− λ xi −γ k 2 (13) k=1 (n) (n+1) Pk (xi ) = (n) 2 Pk 1 . e− λ xi −γ k (n) (x ) Π i (14) 0 0 One can initialize γk and Pk (xi ) by choosing K points at random from the data set and 0 letting γk = xi(k) and Pk = 1/K, then use equations (13) and (14) to initialize the 0 association map Pk (xi ). The iteration procedure (11 − 14) is terminated once n−1 n max |γk − γk | < , (15) k where determines the precision with which the manifold points are located. The above algorithm requires the information distortion cost λ = −δD/δI as a parameter. If we want to find the manifold description (M, P (M|X)) for a particular value of information I, we can plot the curve I(λ) and, because it’s monotonic, we can easily find the solution iteratively, arbitrarily close to a given value of I. 4 Evaluating the solution The result of our algorithm is a collection of K manifold points, γk ∈ M ⊂ RD , and a stochastic projection map, Pk (xi ), which maps the points from the data space onto the manifold. Presumably, the manifold M has a well defined intrinsic dimensionality d. If we imagine a little ball of radius r centered at some point on the manifold of intrinsic dimensionality d, and then we begin to grow the ball, the number of points on the manifold that fall inside will scale as rd . On the other hand, this will not be necessarily true for the original data set, since it is more spread out and resembles locally the whole embedding space RD . The Grassberger-Procaccia algorithm [12] captures this intuition by calculating the correlation dimension. First, calculate the correlation integral: 2 C(r) = N (N − 1) N N H(r − |xi − xj |), (16) i=1 j>i where H(x) is a step function with H(x) = 1 for x > 0 and H(x) = 0 for x < 0. This measures the probability that any two points fall within the ball of radius r. Then define 0 original data manifold representation -2 ln C(r) -4 -6 -8 -10 -12 -14 -5 -4 -3 -2 -1 0 1 2 3 4 ln r Figure 2: The semicircle. (a) N = 3150 points randomly scattered around a semicircle of radius R = 20 by a normal process with σ = 1 and the final positions of 100 manifold points. (b) Log log plot of C(r) vs r for both the manifold points (squares) and the original data set (circles). the correlation dimension at length scale r as the slope on the log log plot. dcorr (r) = d log C(r) . d log r (17) For points lying on a manifold the slope remains constant and the dimensionality is fixed, while the correlation dimension of the original data set quickly approaches that of the embedding space as we decrease the length scale. Note that the slope at large length scales always tends to decrease due to finite span of the data and curvature effects and therefore does not provide a reliable estimator of intrinsic dimensionality. 5 5.1 Examples Semi-Circle We have randomly generated N = 3150 data points scattered by a normal distribution with σ = 1 around a semi-circle of radius R = 20 (Figure 2a). Then we ran the algorithm with K = 100 and λ = 8, and terminated the iterative algorithm once the precision = 0.1 had been reached. The resulting manifold is depicted in red. To test the quality of our solution, we calculated the correlation dimension as a function of spatial scale for both the manifold points and the original data set (Figure 2b). As one can see, the manifold solution is of fixed dimensionality (the slope remains constant), while the original data set exhibits varying dimensionality. One should also note that the manifold points have dcorr (r) = 1 well into the territory where the original data set becomes two dimensional. This is what we should expect: at a given information level (in this case, I = 2.8 bits), the information about the second (local) degree of freedom is lost, and the resulting structure is one dimensional. A note about the parameters. Letting K → ∞ does not alter the solution. The information I and distortion D remain the same, and the additional points γk also fall on the semi-circle and are simple interpolations between the original manifold points. This allows us to claim that what we have found is a manifold, and not an agglomeration of clustering centers. Second, varying λ changes the information resolution I(λ): for small λ (high information rate) the local structure becomes important. At high information rate the solution undergoes 3.5 3 3 3 2.5 2.5 2 2.5 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0 0.5 -0.5 0 0 -1 5 -0.5 -0.5 4 1 3 0.5 2 -1 -1 0 1 -0.5 0 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Figure 3: S-shaped sheet in 3D. (a) N = 2000 random points on a surface of an S-shaped sheet in 3D. (b) Normal noise added. XY-plane projection of the data. (c) Optimal manifold points in 3D, projected onto an XY plane for easy visualization. a phase transition, and the resulting manifold becomes two dimensional to take into account the local structure. Alternatively, if we take λ → ∞, the cost of information rate becomes very high and the whole manifold collapses to a single point (becomes zero dimensional). 5.2 S-surface Here we took N = 2000 points covering an S-shaped sheet in three dimensions (Figure 3a), and then scattered the position of each point by adding Gaussian noise. The resulting manifold is difficult to visualize in three dimensions, so we provided its projection onto an XY plane for an illustrative purpose (Figure 3b). After running our algorithm we have recovered the original structure of the manifold (Figure 3c). 6 Discussion The problem of finding low dimensional manifolds in high dimensional data requires regularization to avoid hgihly folded, Peano curve like solutions which are low dimensional in the mathematical sense but fail to capture our geometric intuition. Rather than constraining geometrical features of the manifold (e.g., the curvature) we have constrained the mutual information between positions on the manifold and positions in the original data space, and this is invariant to all invertible coordinate transformations in either space. This approach enforces “smoothness” of the manifold only implicitly, but nonetheless seems to work. Our information theoretic approach has considerable generality relative to methods based on specific smoothing criteria, but requires a separate algorithm, such as LLE, to give the manifold points curvilinear coordinates. For data points not in the original data set, equations (9-10) and (13-14) provide the mapping onto the manifold. Eqn. (7) gives the probability distribution over the latent variable, known in the density modeling literature as “the prior.” The running time of the algorithm is linear in N . This compares favorably with other methods and makes it particularly attractive for very large data sets. The number of manifold points K usually is chosen as large as possible, given the computational constraints, to have a dense sampling of the manifold. However, a value of K << N is often sufficient, since D(λ, K) → D(λ) and I(λ, K) → I(λ) approach their limits rather quickly (the convergence improves for large λ and deteriorates for small λ). In the example of a semi-circle, the value of K = 30 was sufficient at the compression level of I = 2.8 bits. In general, the threshold value for K scales exponentially with the latent dimensionality (rather than with the dimensionality of the embedding space). The choice of λ depends on the desired information resolution, since I depends on λ. Ideally, one should plot the function I(λ) and then choose the region of interest. I(λ) is a monotonically decreasing function, with the kinks corresponding to phase transitions where the optimal manifold abruptly changes its dimensionality. In practice, we may want to run the algorithm only for a few choices of λ, and we would like to start with values that are most likely to correspond to a low dimensional latent variable representation. In this case, as a rule of thumb, we choose λ smaller, but on the order of the largest linear dimension (i.e. λ/2 ∼ Lmax ). The dependence of the optimal manifold M(I) on information resolution reflects the multi-scale nature of the data and should not be taken as a shortcoming. References [1] Bregler, C. & Omohundro, S. (1995) Nonlinear image interpolation using manifold learning. Advances in Neural Information Processing Systems 7. MIT Press. [2] Hastie, T. & Stuetzle, W. (1989) Principal curves. Journal of the American Statistical Association, 84(406), 502-516. [3] Roweis, S. & Saul, L. (2000) Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326. [4] Tenenbaum, J., de Silva, V., & Langford, J. (2000) A global geometric framework for nonlinear dimensionality reduction. Science, 290 , 2319–2323. [5] Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441,498-520. [6] Bishop, C., Svensen, M. & Williams, C. (1998) GTM: The generative topographic mapping. Neural Computation,10, 215–234. [7] Brand, M. (2003) Charting a manifold. Advances in Neural Information Processing Systems 15. MIT Press. [8] Scholkopf, B., Smola, A. & Muller K-R. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319. [9] Kramer, M. (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37, 233-243. [10] Belkin M. & Niyogi P. (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396. [11] Blahut, R. (1972) Computation of channel capacity and rate distortion function. IEEE Trans. Inform. Theory, IT-18, 460-473. [12] Grassberger, P., & Procaccia, I. (1983) Characterization of strange attractors. Physical Review Letters, 50, 346-349.
2 0.73259288 66 nips-2003-Extreme Components Analysis
Author: Max Welling, Christopher Williams, Felix V. Agakov
Abstract: Principal components analysis (PCA) is one of the most widely used techniques in machine learning and data mining. Minor components analysis (MCA) is less well known, but can also play an important role in the presence of constraints on the data distribution. In this paper we present a probabilistic model for “extreme components analysis” (XCA) which at the maximum likelihood solution extracts an optimal combination of principal and minor components. For a given number of components, the log-likelihood of the XCA model is guaranteed to be larger or equal than that of the probabilistic models for PCA and MCA. We describe an efficient algorithm to solve for the globally optimal solution. For log-convex spectra we prove that the solution consists of principal components only, while for log-concave spectra the solution consists of minor components. In general, the solution admits a combination of both. In experiments we explore the properties of XCA on some synthetic and real-world datasets.
3 0.72954601 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation
Author: Yuanqing Li, Shun-ichi Amari, Sergei Shishkin, Jianting Cao, Fanji Gu, Andrzej S. Cichocki
Abstract: In this paper, sparse representation (factorization) of a data matrix is first discussed. An overcomplete basis matrix is estimated by using the K−means method. We have proved that for the estimated overcomplete basis matrix, the sparse solution (coefficient matrix) with minimum l1 −norm is unique with probability of one, which can be obtained using a linear programming algorithm. The comparisons of the l1 −norm solution and the l0 −norm solution are also presented, which can be used in recoverability analysis of blind source separation (BSS). Next, we apply the sparse matrix factorization approach to BSS in the overcomplete case. Generally, if the sources are not sufficiently sparse, we perform blind separation in the time-frequency domain after preprocessing the observed data using the wavelet packets transformation. Third, an EEG experimental data analysis example is presented to illustrate the usefulness of the proposed approach and demonstrate its performance. Two almost independent components obtained by the sparse representation method are selected for phase synchronization analysis, and their periods of significant phase synchronization are found which are related to tasks. Finally, concluding remarks review the approach and state areas that require further study. 1
4 0.72827643 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
Author: Amos J. Storkey
Abstract: Discrete Fourier transforms and other related Fourier methods have been practically implementable due to the fast Fourier transform (FFT). However there are many situations where doing fast Fourier transforms without complete data would be desirable. In this paper it is recognised that formulating the FFT algorithm as a belief network allows suitable priors to be set for the Fourier coefficients. Furthermore efficient generalised belief propagation methods between clusters of four nodes enable the Fourier coefficients to be inferred and the missing data to be estimated in near to O(n log n) time, where n is the total of the given and missing data points. This method is compared with a number of common approaches such as setting missing data to zero or to interpolation. It is tested on generated data and for a Fourier analysis of a damaged audio signal. 1
5 0.72280312 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
Author: Susanne Still, William Bialek, Léon Bottou
Abstract: We argue that K–means and deterministic annealing algorithms for geometric clustering can be derived from the more general Information Bottleneck approach. If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that define the optimal solution as an iterative algorithm, then a set of “smooth” initial conditions selects solutions with the desired geometrical properties. In addition to conceptual unification, we argue that this approach can be more efficient and robust than classic algorithms. 1
6 0.72179657 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
7 0.71925116 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
8 0.71750987 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
9 0.71469975 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class
10 0.71212119 126 nips-2003-Measure Based Regularization
11 0.70855498 120 nips-2003-Locality Preserving Projections
12 0.70837021 113 nips-2003-Learning with Local and Global Consistency
13 0.70702696 107 nips-2003-Learning Spectral Clustering
14 0.70638072 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
15 0.70456499 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
16 0.7035675 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
17 0.70297664 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
18 0.70215714 161 nips-2003-Probabilistic Inference in Human Sensorimotor Processing
19 0.70103931 115 nips-2003-Linear Dependent Dimensionality Reduction
20 0.69942665 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels