nips nips2003 nips2003-66 knowledge-graph by maker-knowledge-mining

66 nips-2003-Extreme Components Analysis


Source: pdf

Author: Max Welling, Christopher Williams, Felix V. Agakov

Abstract: Principal components analysis (PCA) is one of the most widely used techniques in machine learning and data mining. Minor components analysis (MCA) is less well known, but can also play an important role in the presence of constraints on the data distribution. In this paper we present a probabilistic model for “extreme components analysis” (XCA) which at the maximum likelihood solution extracts an optimal combination of principal and minor components. For a given number of components, the log-likelihood of the XCA model is guaranteed to be larger or equal than that of the probabilistic models for PCA and MCA. We describe an efficient algorithm to solve for the globally optimal solution. For log-convex spectra we prove that the solution consists of principal components only, while for log-concave spectra the solution consists of minor components. In general, the solution admits a combination of both. In experiments we explore the properties of XCA on some synthetic and real-world datasets.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Principal components analysis (PCA) is one of the most widely used techniques in machine learning and data mining. [sent-8, score-0.122]

2 Minor components analysis (MCA) is less well known, but can also play an important role in the presence of constraints on the data distribution. [sent-9, score-0.141]

3 In this paper we present a probabilistic model for “extreme components analysis” (XCA) which at the maximum likelihood solution extracts an optimal combination of principal and minor components. [sent-10, score-0.58]

4 For log-convex spectra we prove that the solution consists of principal components only, while for log-concave spectra the solution consists of minor components. [sent-13, score-0.71]

5 In general, the solution admits a combination of both. [sent-14, score-0.057]

6 1 Introduction The simplest and most widely employed technique to reduce the dimensionality of a data distribution is to linearly project it onto the subspace of highest variation (principal components analysis or PCA). [sent-16, score-0.201]

7 For some data distributions however, it is not the directions of large variation that are most distinctive, but the directions of very small variation, i. [sent-18, score-0.315]

8 In this paper we argue that in reducing the dimensionality of the data, we may want to preserve these constrained directions alongside some of the directions of large variability. [sent-21, score-0.318]

9 The proposed method, termed “extreme components analysis” or XCA, holds the middle ground between PCA and MCA (minor components analysis–the method that projects on directions of low variability). [sent-22, score-0.383]

10 The objective that determines the optimal combination of principal and minor components derives from the probabilistic formulation of XCA, which neatly generalizes the probabilistic models for PCA and MCA. [sent-23, score-0.543]

11 We propose a very simple and efficient algorithm to extract the optimal combination of principal and minor components and prove some results relating the shape of the log-spectrum to this solution. [sent-25, score-0.481]

12 MCA Consider a plane embedded in 3 dimensions that cuts through the origin. [sent-32, score-0.082]

13 wT x = 0 (1) where A is a 3 × 2 matrix, the columns of which form a basis in the plane, and w is a vector orthogonal to the plane. [sent-35, score-0.081]

14 In the first description we parameterize the modes of variation, while in the second we parameterize the direction of no variation or the direction in which the points are constrained. [sent-36, score-0.107]

15 Note that we only need 3 real parameters to describe a plane in terms of its constraint versus 6 parameters to describe it in terms of its modes of variation. [sent-37, score-0.07]

16 More generally, if we want to describe a d-dimensional subspace in D dimensions we may use D − d constraint directions or d subspace directions. [sent-38, score-0.251]

17 Alternatively, one may describe the data as D − d approximately satisfied constraints, embedded in a high variance background model. [sent-42, score-0.065]

18 The variance of the constrained direction, 1/||w||2 , should be smaller than that of the background model. [sent-44, score-0.084]

19 By multiplying D − d of these “Gaussian pancake” models [6] a probabilistic model for MCA results with inverse covariance given by, ID −1 (4) CMCA = 2 + W T W σ0 where wT form the rows of W . [sent-45, score-0.201]

20 Thus, while PPCA explicitly models the directions of large variability, PMCA explicitly models the directions of small variability. [sent-47, score-0.278]

21 3 Extreme Components Analysis (XCA) Probabilistic PCA can be interpreted as a low variance data cloud which has been stretched out in certain directions. [sent-48, score-0.071]

22 Probabilistic MCA on the other hand can be thought of as a large variance data cloud which has been pushed inward in certain directions. [sent-49, score-0.071]

23 Given the Gaussian assumption, the approximation that we make is due to the fact that we replace the variances in the remaining directions by their average. [sent-50, score-0.166]

24 Intuitively, better approximations may be obtained by identifying the set of eigenvalues which, when averaged, induces the smallest error. [sent-51, score-0.122]

25 The appropriate model, to be discussed below, will both have elongated and contracted directions in its equiprobable contours, resulting in a mix of principal and minor components. [sent-52, score-0.453]

26 The restricting aspect of the PPCA model is that the noise n is added in all directions in input space. [sent-55, score-0.176]

27 Since adding random variables always results in increased variance, the directions modelled by the vectors ai must necessarily have larger variance than the noise directions, resulting in principal components. [sent-56, score-0.368]

28 In order to remove that constraint we need to add the noise only in the directions orthogonal to the ai ’s. [sent-57, score-0.283]

29 This leads to the following “causal generative model” model1 for XCA, ⊥ x = Ay + PA n 2 n ∼ N [0, σ0 ID ] y ∼ N [0, Id ] (5) ⊥ where PA = ID −A(AT A)−1 AT is the projection operator on the orthogonal complement of the space spanned by the columns of A. [sent-58, score-0.144]

30 The covariance of this model is found to be 2 ⊥ CXCA = σ0 PA + AAT . [sent-59, score-0.083]

31 the components are allowed to model directions of large or small variance. [sent-63, score-0.261]

32 The result is that x has a Gaussian distribution with with inverse covariance, −1 CXCA = 1 ⊥ + WTW 2P σ0 W (8) ⊥ where PW = ID − W T (W W T )−1 W is the projection operator on the orthogonal com2(d−D) −1 plement of W . [sent-66, score-0.055]

33 2 Maximum Likelihood Solution For a centered (zero mean) dataset {x} of size N the log-likelihood is given by, L=− ND N N (D − d) log(2π)+ log det(W W T )+ log 2 2 2 1 N −1 2 − 2 tr CXCA S σ0 (9) N 1 where S = N i=1 xi xT ∈ RD×D is the covariance of the data. [sent-76, score-0.252]

34 (11) σ0 Next we note that the projections of this equation on the space spanned by W and its orthogonal complement should hold independently. [sent-84, score-0.118]

35 Thus, multiplying equation 11 on the ⊥ left by either PW or PW , and multiplying it on the right by RΛ−1 , we obtain the following two equations, U Λ−2 = U U T SU, Λ−2 = U U T SU SU Id − 2 σ0 (12) Λ−2 Id − 2 σ0 . [sent-85, score-0.102]

36 13 and right multiplying with (Id − Λ−2 /σ0 )−1 we find the eigenvalue equation2 , SU = U Λ−2 . [sent-88, score-0.078]

37 We thus conclude that U is given by the eigenvectors of the sample covariance matrix S, while the elements 2 of the (diagonal) matrix Λ are given by λi = 1/σi with σi the eigenvalues of S (i. [sent-91, score-0.253]

38 1/σ0 we find, 2 σ0 = 1 1 1 ⊥ tr PW S = tr(S) − tr(U Λ−2 U T ) = D−d D−d D−d 2 σi (15) i∈G where G is the set of all eigenvalues of S which are not represented in Λ−2 . [sent-97, score-0.174]

39 The above equation expresses the fact that these eigenvalues are being approximated through their 2 average σ0 . [sent-98, score-0.14]

40 9) we find, L=− ND N log(2πe) − 2 2 2 log(σi ) − i∈C N (D − d) log 2 1 D−d 2 σi (16) i∈G where C is the set of retained eigenvalues. [sent-100, score-0.175]

41 The log-likelihood has now been reduced to a 2 function of the discrete set of eigenvalues {σi } of S. [sent-101, score-0.122]

42 2 As we will see later, the left-out eigenvalues have to be contiguous in the spectrum, implying 2 that the matrix (Id − Λ−2 /σ0 )−1 can only be singular if there is a retained eigenvalue that is equal to all left-out eigenvalues. [sent-102, score-0.388]

43 3 An Algorithm for XCA 2 To optimize 16 efficiently we first note that the sum of the eigenvalues {σi } is constant: 2 i∈C∪G σi = tr(S). [sent-105, score-0.122]

44 We may use this to rewrite L in terms of the retained eigenvalues only. [sent-106, score-0.27]

45 We define the following auxiliary cost to be minimized which is proportional to −L up to irrelevant constants, 2 log σi + (D − d) log(tr(S) − K= i∈C 2 σi ). [sent-107, score-0.085]

46 (17) i∈C Next we recall an important result that was proved in [4]: the minimizing solution has 2 eigenvalues σi , i ∈ G which are contiguous in the (ordered) spectrum, i. [sent-108, score-0.214]

47 the eigenvalues which are averaged form a “gap” in the spectrum. [sent-110, score-0.122]

48 With this result, the search for the optimal solution has been reduced from exponential to linear in the number of retained dimensions d. [sent-111, score-0.264]

49 Thus we obtain the following algorithm for determining the optimal d extreme components: (1) Compute the first d principal components and the first d minor components, (2) for all d possible positions of the ”gap” compute the cost K in eqn. [sent-112, score-0.556]

50 17, and (3) select the solution that minimizes K. [sent-113, score-0.057]

51 The only difference being that certain constraints forcing the solution to contain only principal or minor components are absent in eqn. [sent-117, score-0.512]

52 For XCA, this opens the possibility for mixed solutions with both principal and minor components. [sent-119, score-0.332]

53 From the above observation we may conclude that the optimal ML solution for XCA will always have larger log-likelihood on the training data then the optimal ML solutions for PPCA and PMCA. [sent-120, score-0.156]

54 Moreover, when XCA contains only principal (or minor) components, it must have equal likelihood on the training data as PPCA (or PMCA). [sent-121, score-0.146]

55 This property suggests to use the logarithm of the eigenvalues of S as the natural quantities since multiplying all eigenvalues with a constant results in a vertical shift of the log-spectrum. [sent-126, score-0.295]

56 Consequently, the properties of the optimal solution only depend on the shape of the log-spectrum. [sent-127, score-0.084]

57 In appendix A we prove the following characterization of the optimal solution, Theorem 1 • A log-linear spectrum has no preference for principal or minor components. [sent-128, score-0.513]

58 • The extreme components of log-convex spectra are principal components. [sent-129, score-0.365]

59 • The extreme components of log-concave spectra are minor components. [sent-130, score-0.481]

60 Although a log-linear spectrum with arbitrary slope has no preference for principal or minor components, the slope does have an impact on the accuracy of the approximation because the variances in the gap are approximated by their average value. [sent-131, score-0.59]

61 A spectrum that can be exactly modelled by PPCA with sufficient retained directions is one which has a pedestal, i. [sent-132, score-0.425]

62 Similarly PMCA can model exactly a spectrum which is constant and then drops off while XCA can model exactly a spectrum with a constant section at some arbitrary position. [sent-135, score-0.272]

63 Some interesting examples of spectra can be obtained from the Fourier (spectral) representation of stationary Gaussian processes. [sent-136, score-0.071]

64 Processes with power-law spectra S(ω) ∝ ω −α are log convex. [sent-137, score-0.116]

65 An example of a spectrum which is log linear is obtained from the RBF covariance function Table 1: Percent classification error of noisy sinusoids as a function of g = D − d. [sent-138, score-0.308]

66 The RBF covariance function on the circle will give 2 rise to eigenvalues λi ∝ e−βi , i. [sent-161, score-0.205]

67 Both PCA and MCA share the convenient property that a solution with d components is contained in the solution with d + 1 components. [sent-164, score-0.236]

68 This is not the case for XCA: the solution with d + 1 components may look totally different than the solution with d components (see inset in Figure 1c), in fact they may not even share a single component! [sent-165, score-0.417]

69 5 Experiments Small Sample Effects When the number of data cases is small relative to the dimensionality of the problem, the log-spectrum tends to bend down on the MC side producing “spurious” minor components in the XCA solution. [sent-166, score-0.377]

70 Minor components that result from finite sample effects, i. [sent-167, score-0.122]

71 This dataset contains 1965 images of size 20 × 28, of which we used 1000 for training and 965 for testing. [sent-171, score-0.054]

72 Note that at the point that minor components appear in the XCA solution (d = 92) the log-likelihood of the training data improves relative to PCA, while the log-likelihood of the test data suffers. [sent-173, score-0.421]

73 Sinusoids in noise p Consider a sum of p sinusoids Y (t) = i=1 Ai cos(ωi t + φi ) sampled at D equallyspaced time points. [sent-174, score-0.081]

74 If each φi is random in (0, 2π) then the covariance Y (t)Y (t ) = p 2 i=1 Pi cos ωi (t − t ) where Pi = Ai /2. [sent-175, score-0.104]

75 By adding white noise to this signal we obtain a non-singular covariance matrix. [sent-178, score-0.12]

76 Instead of using the exact covariance matrix for each we approximate the covariance matrix using either XCA, PMCA or PPCA. [sent-180, score-0.214]

77 We then compare the accuracy of a classification task using either the exact covariance matrix, or the approximations. [sent-181, score-0.083]

78 (Note that although the covariance can be calculated exactly the generating process is not in fact a Gaussian process. [sent-182, score-0.083]

79 For g = 2 the XCA solution for both classes is PPCA, and for g = 6, 7, 8 it is PMCA; in between both classes have true XCA solutions. [sent-203, score-0.057]

80 retained dimensions 500 (a) −100 0 −98 0 0 5 5 10 nr. [sent-207, score-0.18]

81 retained dimensions (b) 10 eigendirection −100 0 6 4 2 0 0 15 15 8 5 5 10 nr. [sent-208, score-0.217]

82 retained dimensions 15 (c) Figure 1: (a) Log-likelihood of the “Frey-faces” training data (top curves) and test data (bottom curves) for PCA (dashed lines) and XCA (solid lines) as a function of the number of components. [sent-210, score-0.207]

83 In Figures 1b and 1c we show the log-likelihood for PCA, MCA and XCA of 335 training cases and 336 test cases respectively. [sent-218, score-0.065]

84 In the inset of Figure 1c we depict the number of PCs and MCs in the XCA solution as we vary the number of retained dimensions. [sent-220, score-0.246]

85 Note the irregular behavior when the number of components is large. [sent-221, score-0.122]

86 In [5] the components were distributed according to a Student-t distribution resulting in a probabilistic model for undercomplete independent components analysis (UICA). [sent-224, score-0.284]

87 Also, can we formulate a Bayesian version of XCA where we predict the number and nature of the components supported by the data? [sent-228, score-0.122]

88 A Proof of Theorem 1 Using the fact that the sum and the product of the eigenvalues are constant we can rewrite the cost eqn. [sent-233, score-0.16]

89 17 (up to irrelevant constants) in terms of the left-out eigenvalues of the spectrum only. [sent-234, score-0.278]

90 We will also use the fact that the left-out eigenvalues are contiguous in the 3 The dataset was obtained by M. [sent-235, score-0.184]

91 def spectrum, and form a “gap” of size g = D − d, ∗  i∗ +g−1 i +g−1 C = g log  fi  e − i=i∗ fi (18) i=i∗ where fi are the log-eigenvalues and i∗ is the location of the left hand side of the gap. [sent-237, score-0.254]

92 This can be expressed as δC = g log 1 + efi∗ +g − efi∗ i∗ +g−1 fi e i=i∗ − (f (i∗ + g) − f (i∗ )) . [sent-239, score-0.106]

93 (19) g−1 Inserting a log-linear spectrum: fi = b + a · i with a < 0 and using the result i=0 ea·i = (eag − 1)/(ea − 1) we find that the change in C vanishes for all log-linear spectra. [sent-240, score-0.061]

94 For the more general case we define corrections ci to the loglinear spectrum that runs through the points fi∗ and fi∗ +g , i. [sent-242, score-0.191]

95 First consider the case of a convex spectrum between i∗ and i∗ +g, which implies that all ci < 0. [sent-245, score-0.191]

96 Inserting this into 19 we find after some algebra δC = g log 1 + eag − 1 g−1 a·i +c[i i =0 e +i∗ ] − ag. [sent-246, score-0.082]

97 (20) Because all ci < 0, the first term must be smaller (more negative) than the corresponding term in the linear case implying that δC < 0 (the second term is unchanged w. [sent-247, score-0.085]

98 Thus, if the entire spectrum is log-convex the gap will be located on the right, resulting in PCs. [sent-250, score-0.177]

99 A similar argument shows that for log-concave spectra the solutions consist of MCs only. [sent-251, score-0.089]

100 The cost 18 is minimized when some of the ci are positive and some negative in such a way that, g−1 g−1 a·i +c[i +i∗ ] ≈ i =0 ea·i Note that due to the exponent in this sum, positive ci i =0 e have a stronger effect than negative ci . [sent-253, score-0.185]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xca', 0.71), ('ppca', 0.227), ('minor', 0.215), ('mca', 0.205), ('id', 0.193), ('pmca', 0.187), ('directions', 0.139), ('spectrum', 0.136), ('retained', 0.13), ('pca', 0.126), ('components', 0.122), ('eigenvalues', 0.122), ('principal', 0.099), ('mcs', 0.093), ('covariance', 0.083), ('cxca', 0.075), ('extreme', 0.073), ('spectra', 0.071), ('pw', 0.069), ('inserting', 0.065), ('fi', 0.061), ('inset', 0.059), ('solution', 0.057), ('pcs', 0.056), ('ci', 0.055), ('orthogonal', 0.055), ('su', 0.053), ('tr', 0.052), ('multiplying', 0.051), ('dimensions', 0.05), ('ea', 0.049), ('log', 0.045), ('sinusoids', 0.044), ('ay', 0.044), ('gap', 0.041), ('wt', 0.041), ('variance', 0.041), ('rt', 0.04), ('probabilistic', 0.04), ('noise', 0.037), ('det', 0.037), ('aat', 0.037), ('eag', 0.037), ('efi', 0.037), ('eigendirection', 0.037), ('variation', 0.037), ('ml', 0.036), ('rd', 0.035), ('contiguous', 0.035), ('wi', 0.035), ('welling', 0.034), ('complement', 0.033), ('poe', 0.032), ('ai', 0.032), ('plane', 0.032), ('spanned', 0.03), ('implying', 0.03), ('cloud', 0.03), ('williams', 0.028), ('rows', 0.027), ('eigenvalue', 0.027), ('optimal', 0.027), ('variances', 0.027), ('dataset', 0.027), ('training', 0.027), ('columns', 0.026), ('sw', 0.026), ('def', 0.026), ('parameterize', 0.026), ('matrix', 0.024), ('background', 0.024), ('dash', 0.024), ('edinburgh', 0.024), ('powers', 0.022), ('subspace', 0.021), ('dimensionality', 0.021), ('hinton', 0.021), ('cos', 0.021), ('frontal', 0.021), ('pa', 0.021), ('cost', 0.02), ('singular', 0.02), ('likelihood', 0.02), ('constraint', 0.02), ('gaussian', 0.02), ('modelled', 0.02), ('irrelevant', 0.02), ('orthonormal', 0.02), ('constraints', 0.019), ('constrained', 0.019), ('cases', 0.019), ('prove', 0.018), ('solutions', 0.018), ('slope', 0.018), ('isotropic', 0.018), ('approximated', 0.018), ('products', 0.018), ('rewrite', 0.018), ('preference', 0.018), ('modes', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 66 nips-2003-Extreme Components Analysis

Author: Max Welling, Christopher Williams, Felix V. Agakov

Abstract: Principal components analysis (PCA) is one of the most widely used techniques in machine learning and data mining. Minor components analysis (MCA) is less well known, but can also play an important role in the presence of constraints on the data distribution. In this paper we present a probabilistic model for “extreme components analysis” (XCA) which at the maximum likelihood solution extracts an optimal combination of principal and minor components. For a given number of components, the log-likelihood of the XCA model is guaranteed to be larger or equal than that of the probabilistic models for PCA and MCA. We describe an efficient algorithm to solve for the globally optimal solution. For log-convex spectra we prove that the solution consists of principal components only, while for log-concave spectra the solution consists of minor components. In general, the solution admits a combination of both. In experiments we explore the properties of XCA on some synthetic and real-world datasets.

2 0.21698602 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

Author: David Hoyle, Magnus Rattray

Abstract: We derive the limiting form of the eigenvalue spectrum for sample covariance matrices produced from non-isotropic data. For the analysis of standard PCA we study the case where the data has increased variance along a small number of symmetry-breaking directions. The spectrum depends on the strength of the symmetry-breaking signals and on a parameter α which is the ratio of sample size to data dimension. Results are derived in the limit of large data dimension while keeping α fixed. As α increases there are transitions in which delta functions emerge from the upper end of the bulk spectrum, corresponding to the symmetry-breaking directions in the data, and we calculate the bias in the corresponding eigenvalues. For kernel PCA the covariance matrix in feature space may contain symmetry-breaking structure even when the data components are independently distributed with equal variance. We show examples of phase-transition behaviour analogous to the PCA results in this case. 1

3 0.080855966 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

Author: Neil D. Lawrence

Abstract: In this paper we introduce a new underlying probabilistic model for principal component analysis (PCA). Our formulation interprets PCA as a particular Gaussian process prior on a mapping from a latent space to the observed data-space. We show that if the prior’s covariance function constrains the mappings to be linear the model is equivalent to PCA, we then extend the model by considering less restrictive covariance functions which allow non-linear mappings. This more general Gaussian process latent variable model (GPLVM) is then evaluated as an approach to the visualisation of high dimensional data for three different data-sets. Additionally our non-linear algorithm can be further kernelised leading to ‘twin kernel PCA’ in which a mapping between feature spaces occurs.

4 0.076211572 115 nips-2003-Linear Dependent Dimensionality Reduction

Author: Nathan Srebro, Tommi S. Jaakkola

Abstract: We formulate linear dimensionality reduction as a semi-parametric estimation problem, enabling us to study its asymptotic behavior. We generalize the problem beyond additive Gaussian noise to (unknown) nonGaussian additive noise, and to unbiased non-additive models. 1

5 0.062764607 128 nips-2003-Minimax Embeddings

Author: Matthew Brand

Abstract: Spectral methods for nonlinear dimensionality reduction (NLDR) impose a neighborhood graph on point data and compute eigenfunctions of a quadratic form generated from the graph. We introduce a more general and more robust formulation of NLDR based on the singular value decomposition (SVD). In this framework, most spectral NLDR principles can be recovered by taking a subset of the constraints in a quadratic form built from local nullspaces on the manifold. The minimax formulation also opens up an interesting class of methods in which the graph is “decorated” with information at the vertices, offering discrete or continuous maps, reduced computational complexity, and immunity to some solution instabilities of eigenfunction approaches. Apropos, we show almost all NLDR methods based on eigenvalue decompositions (EVD) have a solution instability that increases faster than problem size. This pathology can be observed (and corrected via the minimax formulation) in problems as small as N < 100 points. 1 Nonlinear dimensionality reduction (NLDR) . Spectral NLDR methods are graph embedding problems where a set of N points X = [x1 , · · · , xN ] ∈ RD×N sampled from a low-dimensional manifold in a ambient space RD is reparameterized by imposing a neighborhood graph G on X and embedding the graph with minimal distortion in a “parameterization” space Rd , d < D. Typically the graph is sparse and local, with edges connecting points to their immediate neighbors. The embedding must keep these edges short or preserve their length (for isometry) or angles (for conformality). The graph-embedding problem was first introduced as a least-squares problem by Tutte [1], and as an eigenvalue problem by Fiedler [2]. The use of sparse graphs to generate metrics for least-squares problems has been studied intensely in the following three decades (see [3]). Modern NLDR methods use graph constraints to generate a metric in a space of embeddings RN . Eigenvalue decomposition (EVD) gives the directions of least or greatest variance under this metric. Typically a subset of d extremal eigenvectors gives the embedding of N points in Rd parameterization space. This includes the IsoMap family [4], the locally linear embedding (LLE) family [5,6], and Laplacian methods [7,8]. Using similar methods, the Automatic Alignment [6] and Charting [9] algorithms embed local subspaces instead of points, and by combining subspace projections thus obtain continuous maps between RD and Rd . This paper introduces a general algebraic framework for computing optimal embeddings directly from graph constraints. The aforementioned methods can can be recovered as special cases. The framework also suggests some new methods with very attractive properties, including continuous maps, reduced computational complexity, and control over the degree of conformality/isometry in the desired map. It also eliminates a solution instability that is intrinsic to EVD-based approaches. A perturbational analysis quantifies the instability. 2 Minimax theorem for graph embeddings We begin with neighborhood graph specified by a nondiagonal weighted adjacency matrix M ∈ RN×N that has the data-reproducing property XM = X (this can be relaxed to XM ≈ X in practice). The graph-embedding and NLDR literatures offer various constructions of M, each appropriate to different sets of assumptions about the original embedding and its sampling X (e.g., isometry, local linearity, noiseless samples, regular sampling, etc.). Typically Mi j = 0 if points i, j are nearby on the intrinsic manifold and |Mi j | is small or zero otherwise. Each point is taken to be a linear or convex combination of its neighbors, and thus M specifies manifold connectivity in the sense that any nondegenerate embedding Y that satisfies YM ≈ Y with small residual YM − Y F will preserve this connectivity and the structure of local neighborhoods. For example, in barycentric embeddings, each point is the average of its neighbors and thus Mi j = 1/k if vertex i is connected to vertex j (of degree k). We will also consider three optional constraints on the embedding : 1. A null-space restriction, where the solution must be outside to the column-space of C ∈ RN×M , M < N. For example, it is common to stipulate that the solution Y be centered, i.e., YC = 0 for C = 1, the constant vector. 2. A basis restriction, where the solution must be a linear combination of the rows of basis Z ∈ RK×N , K ≤ N. This can be thought of as information placed at the vertices of the graph that serves as example inputs for a target NLDR function. We will use this to construct dimension-reducing radial basis function networks. 3. A metric Σ ∈ RN×N that determines how error is distributed over the points. For example, it might be important that boundary points have less error. We assume that Σ is symmetric positive definite and has factorization Σ = AA (e.g., A could be a Cholesky factor of Σ). In most settings, the optional matrices will default to the identity matrix. In this context, we define the per-dimension embedding error of row-vector yi ∈ rows(Y) to be . EM (yi ) = max yi ∈range(Z),, K∈RM×N (yi (M + CD) − yi )A yi A (1) where D is a matrix constructed by an adversary to maximize the error. The optimizing yi is a vector inside the subspace spanned by the rows of Z and outside the subspace spanned by the columns of C, for which the reconstruction residual yi M−yi has smallest norm w.r.t. the metric Σ. The following theorem identifies the optimal embedding Y for any choice of M, Z, C, Σ: Minimax solution: Let Q ∈ SK×P be a column-orthonormal basis of the null-space of the rows of ZC, with P = K − rank(C). Let B ∈ RP×P be a square factor satisfying B B = Q ZΣZ Q, e.g., a Cholesky factor (or the “R” factor in QR-decomposition of (Q ZA) ). Compute the left singular vectors U ∈ SN×N of Udiag(s)V = B− Q Z(I − M)A, with . singular values s = [s1 , · · · , sP ] ordered s1 ≤ s2 ≤ · · · ≤ s p . Using the leading columns U1:d of U, set Y = U1:d B− Q Z. Theorem 1. Y is the optimal (minimax) embedding in Rd with error [s1 , · · · , sd ] 2 : . Y = U1:d B− Q Z = arg min ∑ EM (yi )2 with EM (yi ) = si . Y∈Rd×N y ∈rows(Y) i (2) Appendix A develops the proof and other error measures that are minimized. Local NLDR techniques are easily expressed in this framework. When Z = A = I, C = [], and M reproduces X through linear combinations with M 1 = 1, we recover LLE [5]. When Z = I, C = [], I − M is the normalized graph Laplacian, and A is a diagonal matrix of vertex degrees, we recover Laplacian eigenmaps [7]. When further Z = X we recover locally preserving projections [8]. 3 Analysis and generalization of charting The minimax construction of charting [9] takes some development, but offers an interesting insight into the above-mentioned methods. Recall that charting first solves for a set of local affine subspace axes S1 ∈ RD×d , S2 , · · · at offsets µ1 ∈ RD , µ2 , · · · that best cover the data and vary smoothly over the manifold. Each subspace offers a chart—a local parameterization of the data by projection onto the local axes. Charting then constructs a weighted mixture of affine projections that merges the charts into a global parameterization. If the data manifold is curved, each projection will assign a point a slightly different embedding, so the error is measured as the variance of these proposed embeddings about their mean. This maximizes consistency and tends to produce isometric embeddings; [9] discusses ways to explicitly optimize the isometry of the embedding. Under the assumption of isometry, the charting error is equivalent to the sumsquared displacements of an embedded point relative to its immediate neighbors (summed over all neighborhoods). To construct the same error criteria in the minimax setting, let xi−k , · · · , xi , · · · , xi+k denote points in the ith neighborhood and let the columns of Vi ∈ R(2k+1)×d be an orthonormal basis of rows of the local parameterization Si [xi−k , · · · , xi , · · · , xi+k ]. Then a nonzero reparameterization will satisfy [yi−k , · · · , yi , · · · , yi+k ]Vi Vi = [yi−k , · · · , yi , · · · , yi+k ] if and only if it preserves the relative position of the points in the local parameterization. Conversely, any relative displacements of the points are isolated by the formula [yi−k , · · · , yi , · · · , yi+k ](I − Vi Vi ). Minimizing the Frobenius norm of this expression is thus equivalent to minimizing the local error in charting. We sum these constraints over all neighborhoods to obtain the constraint matrix M = I − ∑i Fi (I − Vi Vi )Fi , where (Fi )k j = 1 iff the jth point of the ith neighborhood is the kth point of the dataset. Because Vi Vi and (I − Vi Vi ) are complementary, it follows that the error criterion of any local NLDR method (e.g., LLE, Laplacian eigenmaps, etc.) must measure the projection of the embedding onto some subspace of (I − Vi Vi ). To construct a continuous map, charting uses an overcomplete radial basis function (RBF) representation Z = [z(x1 ), z(x2 ), · · · z(xN )], where z(x) is a vector that stacks z1 (x), z2 (x), etc., and pm (x) . Km (x − µm ) , zm (x) = 1 ∑m pm (x) −1 . pm (x) = N (x|µm , Σm ) ∝ e−(x−µm ) Σm (x−µm )/2 (3) (4) and Km is any local linear dimensionality reducer, typically Sm itself. Each column of Z contains many “views” of the same point that are combined to give its low-dimensional embedding. Finally, we set C = 1, which forces the embedding of the full data to be centered. Applying the minimax solution to these constraints yields the RBF network mixing ma. trix, f (x) = U1:d B− Q z(x). Theorem 1 guarantees that the resulting embedding is leastsquares optimal w.r.t. Z, M, C, A at the datapoints f (xi ), and because f (·) is an affine transform of z(·) it smoothly interpolates the embedding between points. There are some interesting variants: Kernel embeddings of the twisted swiss roll generalized EVD minimax SVD UR corner detail LL corner detail Fig. 1. Minimax and generalized EVD solution for kernel eigenmap of a non-developable swiss roll. Points are connected into a grid which ideally should be regular. The EVD solution shows substantial degradation. Insets detail corners where the EVD solution crosses itself repeatedly. The border compression is characteristic of Laplacian constraints. One-shot charting: If we set the local dimensionality reducers to the identity matrix (all Km = I), then the minimax method jointly optimizes the local dimensionality reduction to charts and the global coordination of the charts (under any choice of M). This requires that rows(Z) ≤ N for a fully determined solution. Discrete isometric charting: If Z = I then we directly obtain a discrete isometric embedding of the data, rather than a continuous map, making this a local equivalent of IsoMap. Reduced basis charting: Let Z be constructed using just a small number of kernels randomly placed on the data manifold, such that rows(Z) N. Then the size of the SVD problem is substantially reduced. 4 Numerical advantage of minimax method Note that the minimax method projects the constraint matrix M into a subspace derived from C and Z and decomposes it there. This suppresses unwanted degrees of freedom (DOFs) admitted by the problem constraints, for example the trivial R0 embedding where all points are mapped to a single point yi = N −1/2 . The R0 embedding serves as a translational DOF in the solution. LLE- and eigenmap-based methods construct M to have a constant null-space so that the translational DOF will be isolated in the EVD as null eigenvalue paired to a constant eigenvector, which is then discarded. However, section 4.1 shows that this construction makes the EVD increasingly unstable as problem size grows and/or the data becomes increasing amenable to low-residual embeddings, ultimately causing solution collapse. As the next paragraph demonstrates, the problem is exacerbated when embedding w.r.t. a basis Z (via the equivalent generalized eigenproblem), partly because the eigenvector associated with the unwanted DOF can have arbitrary structure. In all cases the problem can be averted by using the minimax formulation with C = 1 to suppress the DOF. A 2D plane was embedded in 3D with a curl, a twist, and 2.5% Gaussian noise, then regularly sampled at 900 points. We computed a kernelized Laplacian eigenmap using 70 random points as RBF centers, i.e., a continous map using M derived from the graph Laplacian and Z constructed as above. The map was computed both via the minimax (SVD) method and via the equivalent generalized eigenproblem, where the translational degree of freedom must be removed by discarding an eigenvector from the solution. The two solutions are algebraically equivalent in every other regard. A variety of eigensolvers were tried; we took −5 excess energy x 10 Eigen spectrum compared to minimax spectrum 15 10 5 0 −5 Eigen spectrum compared to minimax spectrum 2 15 deviation excess energy x 10 10 5 100 200 eigenvalue Error in null embedding −5 x 10 0 −2 −4 −6 −8 0 100 −5 eigenvalue Error in null embedding 200 100 200 300 400 500 point 600 700 800 900 Fig. 2. Excess energy in the eigenspectrum indicates that the translational DOF has contam2 inated many eigenvectors. If the EVD had successfully isolated the unwanted DOF, then its 0 remaining eigenvalues should be identical to those derived from the minimax solution. The −2 −4 graph at left shows the difference in the eigenspectra. The graph at right shows the EVD −6 solution’s deviation from the translational vector y0 = 1 · N −1/2 ≈ .03333. If the numer−8 ics were100 200 the line would be flat, but in practice the deviation is significant enough perfect 300 400 500 600 700 800 900 point (roughly 1% of the diameter of the embedding) to noticably perturb points in figure 1. deviation x 10 the best result. Figure 1 shows that the EVD solution exhibits many defects, particularly a folding-over of the manifold at the top and bottom edges and at the corners. Figure 2 shows that the noisiness of the EVD solution is due largely to mutual contamination of numerically unstable eigenvectors. 4.1 Numerical instability of eigen-methods The following theorem uses tools of matrix perturbation theory to show that as the problem size increases, the desired and unwanted eigenvectors become increasingly wobbly and gradually contaminate each other, leading to degraded solutions. More precisely, the low-order eigenvalues are ill-conditioned and exhibit multiplicities that may be true (due to noiseless samples from low-curvature manifolds) or false (due to numerical noise). Although in many cases some post-hoc algebra can “filter” the unwanted components out of the contaminated eigensolution, it is not hard to construct cases where the eigenvectors cannot be cleanly separated. The minimax formulation is immune to this problem because it explicitly suppresses the gratuitous component(s) before matrix decomposition. Theorem 2. For any finite numerical precision, as the number of points N increases, the Frobenius norm of numerical noise in the null eigenvector v0 can grow as O(N 3/2 ), and the eigenvalue problem can approach a false multiplicity at a rate as fast as O(N 3/2 ), at which point the eigenvectors of interest—embedding and translational—are mutually contaminated and/or have an indeterminate eigenvalue ordering. Please see appendix B for the proof. This theorem essentially lower-bounds an upperbound on error; examples can be constructed in which the problem is worse. For example, it can be shown analytically that when embedding points drawn from the simple curve xi = [a, cos πa] , a ∈ [0, 1] with K = 2 neighbors, instabilities cannot be bounded better than O(N 5/2 ); empirically we see eigenvector mixing with N < 100 points and we see it grow at the rate ≈ O(N 4 )—in many different eigensolvers. At very large scales, more pernicious instabilities set in. E.g., by N = 20000 points, the solution begins to fold over. Although algebraic multiplicity and instability of the eigenproblem is conceptually a minor oversight in the algorithmic realizations of eigenfunction embeddings, as theorem 2 shows, the consequences are eventually fatal. 5 Summary One of the most appealing aspects of the spectral NLDR literature is that algorithms are usually motivated from analyses of linear operators on smooth differentiable manifolds, e.g., [7]. Understandably, these analysis rely on assumptions (e.g., smoothness or isometry or noiseless sampling) that make it difficult to predict what algorithmic realizations will do when real, noisy data violates these assumptions. The minimax embedding theorem provides a complete algebraic characterization of this discrete NLDR problem, and provides a solution that recovers numerically robustified versions of almost all known algorithms. It offers a principled way of constructing new algorithms with clear optimality properties and good numerical conditioning—notably the construction of a continuous NLDR map (an RBF network) in a one-shot optimization ( SVD ). We have also shown how to cast several local NLDR principles in this framework, and upgrade these methods to give continuous maps. Working in the opposite direction, we sketched the minimax formulation of isometric charting and showed that its constraint matrix contains a superset of all the algebraic constraints used in local NLDR techniques. References 1. W.T. Tutte. How to draw a graph. Proc. London Mathematical Society, 13:743–768, 1963. 2. Miroslav Fiedler. A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czech. Math. Journal, 25:619–633, 1975. 3. Fan R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. American Mathematical Society, 1997. 4. Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000. 5. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. 6. Yee Whye Teh and Sam T. Roweis. Automatic alignment of hidden representations. In Proc. NIPS-15, 2003. 7. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. volume 14 of Advances in Neural Information Processing Systems, 2002. 8. Xiafei He and Partha Niyogi. Locality preserving projections. Technical Report TR-2002-09, University of Chicago Computer Science, October 2002. 9. Matthew Brand. Charting a manifold. volume 15 of Advances in Neural Information Processing Systems, 2003. 10. G.W. Stewart and Ji-Guang Sun. Matrix perturbation theory. Academic Press, 1990. A Proof of minimax embedding theorem (1) The burden of this proof is carried by supporting lemmas, below. To emphasize the proof strategy, we give the proof first; supporting lemmas follow. Proof. Setting yi = li Z, we will solve for li ∈ columns(L). Writing the error in terms of li , EM (li ) = max K∈RM×N li Z(I − M − CK)A li Z(I − M)A − li ZCKA = max . M×N li ZA li ZA K∈R (5) The term li ZCKA produces infinite error unless li ZC = 0, so we accept this as a constraint and seek li Z(I − M)A min . (6) li ZA li ZC=0 By lemma 1, that orthogonality is satisfied by solving the problem in the space orthogonal . to ZC; the basis for this space is given by columns of Q = null((ZC) ). By lemma 2, the denominator of the error specifies the metric in solution space to be ZAA Z ; when the problem is projected into the space orthogonal to ZC it becomes Q (ZAA Z )Q. Nesting the “orthogonally-constrained-SVD” construction of lemma 1 inside the “SVD-under-a-metric” lemma 2, we obtain a solution that uses the correct metric in the orthogonal space: B B = Q ZAA Z Q − Udiag(s)V = B (7) {Q(Z(I − M)A)} (8) L = QB−1 U (9) where braces indicate the nesting of lemmas. By the “best-projection” lemma (#3), if we order the singular values by ascending magnitude, L1:d = arg min J∈RN×d ∑ji ∈cols(J) ( j Z(I − M)A / j )2 ZΣZ The proof is completed by making the substitutions L Z → Y and x A → x Σ = AA ), and leaving off the final square root operation to obtain (Y )1:d = arg min ∑ji ∈cols(J) j (I − M) Σ / j J∈RN×d (10) Σ (for 2 Σ . (11) Lemma 1. Orthogonally constrained SVD: The left singular vectors L of matrix M under . SVD the constraint U C = 0 are calculated as Q = null(C ), Udiag(s)V ← Q M, L = QU. Proof. First observe that L is orthogonal to C: By definition, the null-space basis satisfies Q C = 0, thus L C = U Q C = 0. Let J be an orthonormal basis for C, with J J = I and Q J = 0. Then Ldiag(s)V = QQ M = (I − JJ )M, the orthogonal projector of C applied to M, proving that the SVD captures the component of M that is orthogonal to C. Lemma 2. SVD with respect to a metric: The vectors li ∈ L, vi ∈ V that diagonalize matrix M with respect to positive definite column-space metric Σ are calculated as B B ← Σ, SVD . Udiag(s)V ← B− M, L = B−1 U satisfy li M / li Σ = si and extremize this form for the extremal singular values smin , smax . Proof. By construction, L and V diagonalize M: L MV = (B−1 U) MV = U (B− M)V = diag(s) (12) B− and diag(s)V = M. Forming the gram matrices of both sides of the last line, we obtain the identity Vdiag(s)2 V = M B−1 B− M = M Σ−1 M, which demonstrates that si ∈ s are the singular values of M w.r.t. column-space metric Σ. Finally, L is orthonormal w.r.t. the metric Σ, because L 2 = L ΣL = U B− B BB−1 U = I. Consequently, Σ l M / l Σ = l M /1 = si vi and by the Courant-Hilbert theorem, smax = max l M / l Σ ; = si . smin = min l M / l Σ . l l (13) (14) Lemma 3. Best projection: Taking L and s from lemma 2, let the columns of L and elements of s be sorted so that s1 ≥ s2 ≥ · · · ≥ sN . Then for any dimensionality 1 ≤ d ≤ N, . L1:d = [l1 , · · · , ld ] = arg max J M (J ΣJ)−1 (15) J∈RN×d = arg max F (16) ∑ji ∈cols(J) ( j M / j Σ )2 (17) J∈RN×d |J ΣJ=I = arg max J∈RN×d J M with the optimum value of all right hand sides being (∑d s2 )1/2 . If the sort order is rei=1 i versed, the minimum of this form is obtained. Proof. By the Eckart-Young-Mirsky theorem, if U MV = diag(s) with singular values . sorted in descending order, then U1:d = [u1 , · · · , ud ] = arg maxU∈SN×d U M F . We first extend this to a non-orthonogonal basis J under a Mahalonobis norm: maxJ∈RN×d J M (J J)−1 = maxU∈SN×d U M F (18) because J M 2 (J J)−1 = trace(M J(J J)−1 J M) = trace(M JJ+ (JJ+ ) M) = (JJ+ )M 2 = UU M 2 = U M 2 since JJ+ is a (symmetric) orthogonal proF F F jector having binary eigenvalues λ ∈ {0, 1} and therefore it is the gram of an thin orthogonal matrix. We then impose a metric Σ on the column-space of J to obtain the first criterion (equation 15), which asks what maximizes variance in J M while minimizing the norm of J w.r.t. metric Σ. Here it suffices to substitute in the leading (resp., trailing) columns of L and verify that the norm is maximized (resp., minimized). Expanding, L1:d M 2 ΣL )−1 = trace((L1:d M) (L1:d ΣL1:d )−1 (L1:d M)) = (L 1:d 1:d trace((L1:d M) I(L1:d M)) = trace((diag(s1:d )V1:d ) (diag(s1:d )V1:d )) = s1:d 2 . Again, by the Eckart-Young-Mirsky theorem, these are the maximal variance-preserving projections, so the first criterion is indeed maximized by setting J to the columns in L corresponding to the largest values in s. Criterion #2 restates the first criterion with the set of candidates for J restricted to (the hyperelliptical manifold of) matrices that reduce the metric on the norm to the identity matrix (thereby recovering the Frobenius norm). Criterion #3 criterion merely expands the above trace by individual singular values. Note that the numerator and denominator can have different metrics because they are norms in different spaces, possibly of different dimension. Finally, that the trailing d eigenvectors minimize these criteria follows directly from the fact that leading N − d singular values account for the maximal part of the variance. B Proof of instability theorem (2) Proof. When generated from a sparse graph with average degree K, weighted connectivity matrix W is sparse and has O(NK) entries. Since the graph vertices represent samples from a smooth manifold, increasing the sampling density N does not change the distribution of magnitudes in W. Consider a perturbation of the nonzero values in W, e.g., W → W + E due to numerical noise E created by finite machine precision. By the weak law of large √ numbers, the Frobenius norm of the sparse perturbation grows as E F ∼ O( N). However the t th -smallest nonzero eigenvalue λt (W) grows as λt (W) = vt Wvt ∼ O(N −1 ), because elements of corresponding eigenvector vt grow as O(N −1/2 ) and only K of those elements are multiplied by nonzero values to form each element of Wvt . In sum, the perturbation E F grows while the eigenvalue λt (W) shrinks. In linear embedding algorithms, . the eigengap of interest is λgap = λ1 − λ0 . The tail eigenvalue λ0 = 0 by construction but it is possible that λ0 > 0 with numerical error, thus λgap ≤ λ1 . Combining these facts, the ratio between the perturbation and the eigengap grows as E F /λgap ∼ O(N 3/2 ) or faster. Now consider the shifted eigenproblem I − W with leading (maximal) eigenvalues 1 − λ0 ≥ 1 − λ1 ≥ · · · and unchanged eigenvectors. From matrix perturbation the. ory [10, thm. V.2.8], when W is perturbed to W = W + E, the change in the lead√ ing eigenvalue from 1 − λ0 to 1 − λ0 is bounded as |λ0 − λ0 | ≤ 2 E F and similarly √ √ 1 − λ1 ≤ 1 − λ1 + 2 E F . Thus λgap ≥ λgap − 2 E F . Since E F /λgap ∼ O(N 3/2 ), the right hand side of the gap bound goes negative at a supralinear rate, implying that the eigenvalue ordering eventually becomes unstable with the possibility of the first and second eigenvalue/vector pairs being swapped. Mutual contamination of the eigenvectors happens well before: Under general (dense) conditions, the change in the eigenvector v0 is bounded E F as v0 − v0 ≤ |λ −λ4 |−√2 E [10, thm. V.2.8]. (This bound is often tight enough to serve F 0 1 as a good approximation.) Specializing this to the sparse embedding matrix, we find that √ √ O( N) O( N) the bound weakens to v0 − 1 · N −1/2 ∼ O(N −1 )−O(√N) > O(N −1 ) = O(N 3/2 ).

6 0.059821036 92 nips-2003-Information Bottleneck for Gaussian Variables

7 0.058890618 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

8 0.05815709 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

9 0.057883941 107 nips-2003-Learning Spectral Clustering

10 0.055648781 112 nips-2003-Learning to Find Pre-Images

11 0.052221354 87 nips-2003-Identifying Structure across Pre-partitioned Data

12 0.046750061 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

13 0.045142017 88 nips-2003-Image Reconstruction by Linear Programming

14 0.044710856 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

15 0.044262491 81 nips-2003-Geometric Analysis of Constrained Curves

16 0.041972473 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

17 0.040545028 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

18 0.040272903 120 nips-2003-Locality Preserving Projections

19 0.039792992 148 nips-2003-Online Passive-Aggressive Algorithms

20 0.03909716 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.153), (1, -0.053), (2, -0.003), (3, 0.003), (4, -0.026), (5, 0.134), (6, 0.061), (7, -0.018), (8, 0.054), (9, -0.082), (10, -0.047), (11, -0.052), (12, 0.014), (13, -0.036), (14, -0.09), (15, -0.051), (16, 0.094), (17, -0.009), (18, 0.149), (19, -0.004), (20, 0.049), (21, -0.018), (22, 0.041), (23, -0.063), (24, 0.064), (25, 0.114), (26, -0.018), (27, -0.045), (28, -0.14), (29, -0.151), (30, 0.096), (31, -0.044), (32, -0.105), (33, 0.041), (34, -0.022), (35, -0.079), (36, -0.092), (37, 0.089), (38, -0.086), (39, 0.084), (40, -0.115), (41, -0.046), (42, 0.153), (43, 0.097), (44, -0.033), (45, 0.034), (46, 0.15), (47, 0.097), (48, -0.259), (49, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94639653 66 nips-2003-Extreme Components Analysis

Author: Max Welling, Christopher Williams, Felix V. Agakov

Abstract: Principal components analysis (PCA) is one of the most widely used techniques in machine learning and data mining. Minor components analysis (MCA) is less well known, but can also play an important role in the presence of constraints on the data distribution. In this paper we present a probabilistic model for “extreme components analysis” (XCA) which at the maximum likelihood solution extracts an optimal combination of principal and minor components. For a given number of components, the log-likelihood of the XCA model is guaranteed to be larger or equal than that of the probabilistic models for PCA and MCA. We describe an efficient algorithm to solve for the globally optimal solution. For log-convex spectra we prove that the solution consists of principal components only, while for log-concave spectra the solution consists of minor components. In general, the solution admits a combination of both. In experiments we explore the properties of XCA on some synthetic and real-world datasets.

2 0.84489101 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

Author: David Hoyle, Magnus Rattray

Abstract: We derive the limiting form of the eigenvalue spectrum for sample covariance matrices produced from non-isotropic data. For the analysis of standard PCA we study the case where the data has increased variance along a small number of symmetry-breaking directions. The spectrum depends on the strength of the symmetry-breaking signals and on a parameter α which is the ratio of sample size to data dimension. Results are derived in the limit of large data dimension while keeping α fixed. As α increases there are transitions in which delta functions emerge from the upper end of the bulk spectrum, corresponding to the symmetry-breaking directions in the data, and we calculate the bias in the corresponding eigenvalues. For kernel PCA the covariance matrix in feature space may contain symmetry-breaking structure even when the data components are independently distributed with equal variance. We show examples of phase-transition behaviour analogous to the PCA results in this case. 1

3 0.49812829 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

Author: Haifeng Li, Tao Jiang, Keshu Zhang

Abstract: A new feature extraction criterion, maximum margin criterion (MMC), is proposed in this paper. This new criterion is general in the sense that, when combined with a suitable constraint, it can actually give rise to the most popular feature extractor in the literature, linear discriminate analysis (LDA). We derive a new feature extractor based on MMC using a different constraint that does not depend on the nonsingularity of the within-class scatter matrix Sw . Such a dependence is a major drawback of LDA especially when the sample size is small. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in this paper. Our preliminary experimental results on face images demonstrate that the new feature extractors are efficient and stable. 1

4 0.49800292 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

Author: Neil D. Lawrence

Abstract: In this paper we introduce a new underlying probabilistic model for principal component analysis (PCA). Our formulation interprets PCA as a particular Gaussian process prior on a mapping from a latent space to the observed data-space. We show that if the prior’s covariance function constrains the mappings to be linear the model is equivalent to PCA, we then extend the model by considering less restrictive covariance functions which allow non-linear mappings. This more general Gaussian process latent variable model (GPLVM) is then evaluated as an approach to the visualisation of high dimensional data for three different data-sets. Additionally our non-linear algorithm can be further kernelised leading to ‘twin kernel PCA’ in which a mapping between feature spaces occurs.

5 0.33785075 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

Author: David Barber, Felix V. Agakov

Abstract: The maximisation of information transmission over noisy channels is a common, albeit generally computationally difficult problem. We approach the difficulty of computing the mutual information for noisy channels by using a variational approximation. The resulting IM algorithm is analagous to the EM algorithm, yet maximises mutual information, as opposed to likelihood. We apply the method to several practical examples, including linear compression, population encoding and CDMA. 1

6 0.32486758 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

7 0.31299272 115 nips-2003-Linear Dependent Dimensionality Reduction

8 0.30838731 92 nips-2003-Information Bottleneck for Gaussian Variables

9 0.29192537 112 nips-2003-Learning to Find Pre-Images

10 0.28550157 184 nips-2003-The Diffusion-Limited Biochemical Signal-Relay Channel

11 0.28162625 107 nips-2003-Learning Spectral Clustering

12 0.27698487 196 nips-2003-Wormholes Improve Contrastive Divergence

13 0.27615631 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

14 0.27068797 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

15 0.26941463 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

16 0.25851151 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion

17 0.25679263 157 nips-2003-Plasticity Kernels and Temporal Statistics

18 0.23938189 190 nips-2003-Unsupervised Color Decomposition Of Histologically Stained Tissue Samples

19 0.23880227 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

20 0.22462066 81 nips-2003-Geometric Analysis of Constrained Curves


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.05), (11, 0.027), (29, 0.052), (30, 0.025), (35, 0.052), (53, 0.142), (56, 0.251), (69, 0.021), (71, 0.062), (76, 0.047), (85, 0.071), (91, 0.095)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94360518 10 nips-2003-A Low-Power Analog VLSI Visual Collision Detector

Author: Reid R. Harrison

Abstract: We have designed and tested a single-chip analog VLSI sensor that detects imminent collisions by measuring radially expansive optic flow. The design of the chip is based on a model proposed to explain leg-extension behavior in flies during landing approaches. A new elementary motion detector (EMD) circuit was developed to measure optic flow. This EMD circuit models the bandpass nature of large monopolar cells (LMCs) immediately postsynaptic to photoreceptors in the fly visual system. A 16 × 16 array of 2-D motion detectors was fabricated on a 2.24 mm × 2.24 mm die in a standard 0.5-µm CMOS process. The chip consumes 140 µW of power from a 5 V supply. With the addition of wide-angle optics, the sensor is able to detect collisions around 500 ms before impact in complex, real-world scenes. 1

same-paper 2 0.81525338 66 nips-2003-Extreme Components Analysis

Author: Max Welling, Christopher Williams, Felix V. Agakov

Abstract: Principal components analysis (PCA) is one of the most widely used techniques in machine learning and data mining. Minor components analysis (MCA) is less well known, but can also play an important role in the presence of constraints on the data distribution. In this paper we present a probabilistic model for “extreme components analysis” (XCA) which at the maximum likelihood solution extracts an optimal combination of principal and minor components. For a given number of components, the log-likelihood of the XCA model is guaranteed to be larger or equal than that of the probabilistic models for PCA and MCA. We describe an efficient algorithm to solve for the globally optimal solution. For log-convex spectra we prove that the solution consists of principal components only, while for log-concave spectra the solution consists of minor components. In general, the solution admits a combination of both. In experiments we explore the properties of XCA on some synthetic and real-world datasets.

3 0.63506907 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

Author: Thomas Natschläger, Wolfgang Maass

Abstract: We employ an efficient method using Bayesian and linear classifiers for analyzing the dynamics of information in high-dimensional states of generic cortical microcircuit models. It is shown that such recurrent circuits of spiking neurons have an inherent capability to carry out rapid computations on complex spike patterns, merging information contained in the order of spike arrival with previously acquired context information. 1

4 0.63483608 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

Author: Jakob J. Verbeek, Sam T. Roweis, Nikos A. Vlassis

Abstract: We propose a non-linear Canonical Correlation Analysis (CCA) method which works by coordinating or aligning mixtures of linear models. In the same way that CCA extends the idea of PCA, our work extends recent methods for non-linear dimensionality reduction to the case where multiple embeddings of the same underlying low dimensional coordinates are observed, each lying on a different high dimensional manifold. We also show that a special case of our method, when applied to only a single manifold, reduces to the Laplacian Eigenmaps algorithm. As with previous alignment schemes, once the mixture models have been estimated, all of the parameters of our model can be estimated in closed form without local optima in the learning. Experimental results illustrate the viability of the approach as a non-linear extension of CCA. 1

5 0.63222349 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

Author: Amos J. Storkey

Abstract: Discrete Fourier transforms and other related Fourier methods have been practically implementable due to the fast Fourier transform (FFT). However there are many situations where doing fast Fourier transforms without complete data would be desirable. In this paper it is recognised that formulating the FFT algorithm as a belief network allows suitable priors to be set for the Fourier coefficients. Furthermore efficient generalised belief propagation methods between clusters of four nodes enable the Fourier coefficients to be inferred and the missing data to be estimated in near to O(n log n) time, where n is the total of the given and missing data points. This method is compared with a number of common approaches such as setting missing data to zero or to interpolation. It is tested on generated data and for a Fourier analysis of a damaged audio signal. 1

6 0.62871182 113 nips-2003-Learning with Local and Global Consistency

7 0.62770987 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

8 0.6268816 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

9 0.62673521 126 nips-2003-Measure Based Regularization

10 0.62654853 78 nips-2003-Gaussian Processes in Reinforcement Learning

11 0.62574226 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

12 0.62500936 107 nips-2003-Learning Spectral Clustering

13 0.62394416 161 nips-2003-Probabilistic Inference in Human Sensorimotor Processing

14 0.62242633 81 nips-2003-Geometric Analysis of Constrained Curves

15 0.6220386 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

16 0.62098807 115 nips-2003-Linear Dependent Dimensionality Reduction

17 0.62067318 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

18 0.62047422 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

19 0.61997652 73 nips-2003-Feature Selection in Clustering Problems

20 0.6197902 79 nips-2003-Gene Expression Clustering with Functional Mixture Models