nips nips2005 nips2005-24 knowledge-graph by maker-knowledge-mining

24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

Source: pdf

Author: Manfred Opper

Abstract: The problem of computing a resample estimate for the reconstruction error in PCA is reformulated as an inference problem with the help of the replica method. Using the expectation consistent (EC) approximation, the intractable inference problem can be solved efﬁciently using only two variational parameters. A perturbative correction to the result is computed and an alternative simpliﬁed derivation is also presented. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract The problem of computing a resample estimate for the reconstruction error in PCA is reformulated as an inference problem with the help of the replica method. [sent-4, score-0.461]

2 Using the expectation consistent (EC) approximation, the intractable inference problem can be solved efﬁciently using only two variational parameters. [sent-5, score-0.245]

3 A perturbative correction to the result is computed and an alternative simpliﬁed derivation is also presented. [sent-6, score-0.263]

4 With the somewhat unorthodox application of approximate inference to resampling in PCA I hope to be able to stress the following points: • Approximate efﬁcient inference techniques can be useful in areas of Machine Learning where one would not necessarily assume that they are applicable. [sent-11, score-0.743]

5 • Approximate inference methods can be highly robust allowing for analytic continuations of model parameters to the complex plane or even noninteger dimensions. [sent-13, score-0.104]

6 • It is not always necessary to use a large number of variational parameters in order to get reasonable accuracy. [sent-14, score-0.06]

7 • Inference methods could be systematically improved using perturbative corrections. [sent-15, score-0.112]

8 The goal is to project data vectors y from a typically high (d-) dimensional space into an optimally chosen lower (q-) dimensional linear space with q << d, thereby minimizing the expected projection error ε = E||y − Pq [y]||2 , where Pq [y] denotes the projection. [sent-18, score-0.178]

9 E stands for an expectation over the distribution of the data. [sent-19, score-0.053]

10 In practice where the distribution is not available, one has to work with a data sample D0 consisting of N T vectors yk = (yk (1), yk (2), . [sent-20, score-0.126]

11 We arrange these vectors into a (d×N ) data matrix Y = (y1 , y2 , . [sent-27, score-0.054]

12 Assuming centered data, the optimal subspace 1 is spanned by the eigenvectors ul of the d × d data covariance matrix C = N YYT corresponding to the q largest eigenvalues λk . [sent-31, score-0.422]

13 We will assume that these correspond to all eigenvectors λk > λ above some threshold value λ. [sent-32, score-0.037]

14 After computing the PCA projection, one would be interested in ﬁnding out if the computed subspace represents the data well by estimating the average projection error on novel data y (ie not contained in D0 ) which are drawn from the same distribution. [sent-33, score-0.194]

15 Fixing the projection Pq , the error can be rewritten as E Tr yyT ul uT l E= (1) λl <λ where the expectation is only over y and the training data are ﬁxed. [sent-34, score-0.255]

16 The training error 2 Et = λl <λ λl can be obtained without knowledge of the distribution but will usually only give an optimistically biased estimate for E. [sent-35, score-0.093]

17 1 A resampling estimate for the error New artiﬁcial data samples D of arbitrary size can be created by resampling a number of data points from D0 with or without replacement. [sent-37, score-1.141]

18 Thus, some yi in D0 may appear multiple times in D and others not at all. [sent-39, score-0.033]

19 The idea of performing PCA on resampled data sets D and testing on the remaining data D0 \D, motivates the following deﬁnition of a resample averaged reconstruction error   1 T Er = ED  Tr yi yi ul uT  (2) l N0 yi ∈D;λl <λ / as a proxy for E. [sent-40, score-0.658]

20 N0 is the expected number of data in D0 which are not contained in the random set D. [sent-43, score-0.036]

21 2 Basic formalism We introduce “occupation numbers” si which count how many times yi is containd in D. [sent-46, score-0.079]

22 D is a diagonal random matrix Dii = Di = 1 (si + δsi ,0 ) µΓ C( ) = Γ YDYT . [sent-48, score-0.081]

23 N (3) C(0) is proportional to the covariance matrix of the resampled data. [sent-49, score-0.257]

24 µN = ED [ i si ] is the expexted number of data in D (counting multiplicities). [sent-52, score-0.046]

25 Using , we can generate expressions that can be used in (2) to sum over the data which are not contained in the set D C (0) = 1 µN T δsj ,0 yj yj . [sent-54, score-0.232]

26 (4) j In the following λk and uk will always denote eigenvalues and eigenvectors of the data dependent (i. [sent-55, score-0.223]

27 k G(Γ − iη) = (6) k Hence, we have λ 0 Er = Er + dλ εr (λ ) (7) 0+ where 1 εr (λ) = lim π η→0+  1 ED  N0  T δsj ,0 Tr yj yj G(−λ − iη)  (8) j 0 deﬁnes the error density from all eigenvalues > 0 and Er is the contribution from the eigenspace with λk = 0. [sent-59, score-0.549]

28 This is a well known construction in statistical physics, and has also been used within the NIPS community to study the distribution of eigenvalues for an average case analysis of PCA [5]. [sent-61, score-0.148]

29 Its use for computing the expected reconstruction error is to my knowledge new. [sent-62, score-0.184]

30 With the (N × N ) kernel matrix K = Z = 1 T NY Y we deﬁne the Gaussian partition function 1 dx exp − xT K−1 + D x 2 1 = |K| 2 Γd/2 (2π)(N −d)/2 1 dd z exp − zT (C( ) + ΓI) z . [sent-63, score-0.365]

31 2 (11) (12) x is an N dimensional integration variable. [sent-64, score-0.036]

32 The equality can be easily shown by expressing the integrals as determinants. [sent-65, score-0.097]

33 1 The ﬁrst representation (11) is useful for computing the resampling average and the second one connects directly to the deﬁnition of the matrix Green’s function G. [sent-66, score-0.592]

34 Note, that by its dependence on the kernel matrix K, a generalization to d = ∞ dimensional feature spaces and kernel PCA is straightforward. [sent-67, score-0.09]

35 The partition function can then be understood as a certain Gaussian process expectation. [sent-68, score-0.088]

36 The free energy F = − ln Z enables us to generate the following quantities −2 ∂ ln Z ∂ = =0 1 µN N T δsj ,0 Tr yj yj G(Γ) (13) j=1 ∂ ln Z d = + Tr G(Γ) (14) ∂Γ Γ where we have used (4) for (13). [sent-70, score-1.312]

37 (13) will be used for the computation of (8) and (14) applies to the density of eigenvalues. [sent-71, score-0.061]

38 Note that the deﬁnition of the partition function Z requires that Γ > 0, whereas the application to the reconstruction error (7) needs negative values Γ = −λ < 0. [sent-72, score-0.306]

39 Hence, an analytic continuation of end results must be performed. [sent-73, score-0.087]

40 −2 4 Resampling average and replicas (13) and (14) show that we can compute the desired resampling averages from the expected free energy −ED [ln Z]. [sent-74, score-0.728]

41 This can be expressed using the “replica trick” of statistical physics (see e. [sent-75, score-0.058]

42 [6]) using 1 (15) ED [ln Z] = lim ln ED [Z n ] , n→0 n where one attempts an approximate computation of ED [Z n ] for integer n and uses a continuation to real numbers at the end. [sent-77, score-0.535]

43 The n times replicated and averaged partition function (11) can be written in the form . [sent-78, score-0.123]

44 , xn ) and n ψ1 (x) = ED exp − 1 xT Dxa 2 a=1 a n ψ2 (x) = exp − 1 xT K−1 xa 2 a=1 a (17) The unaveraged partition function Z (11) is Gaussian, but the averaged Z (n) is not and usually intractable. [sent-83, score-0.179]

45 5 Approximate inference To approximate Z (n) , we will use the EC approximation recently introduced by Opper & Winther [1]. [sent-84, score-0.198]

46 p1 tries to mimic the intractable p(x) ∝ ψ1 (x) ψ2 (x), replacing the multivariate Gaussian ψ2 by a simpler, i. [sent-86, score-0.064]

47 This additive renormalization of the free energy − ln Z will not inﬂuence the subsequent computations. [sent-89, score-0.472]

48 One may think of using a general diagonal matrix Λ1 , but we will restrict ourselves in the present case to the simplest case of a spherical Gaussian with a single parameter Λ1 . [sent-91, score-0.142]

49 The strategy is to split Z (n) into a product of Z1 and a term that has to be further approximated: Z (n) T x T x = Z1 dx p1 (x) ψ2 (x) eΛ1 x ≈ Z1 dx p0 (x) ψ2 (x) eΛ1 x (19) (n) ≡ ZEC (Λ1 , Λ0 ) . [sent-92, score-0.446]

50 The approximation replaces the intractable average over p1 by a tractable one over p0 . [sent-93, score-0.099]

51 To optmize Λ1 and Λ0 we argue as follows: We try to make p0 as close as possible to p1 by matching the moments xT x 1 = xT x 0 . [sent-94, score-0.034]

52 Second, since the true partition function Z (n) is independent of Λ1 , we expect that a good approximation to Z (n) should be stationary with respect to variations of Λ1 . [sent-97, score-0.189]

53 Both conditions can be expressed by the (n) requirement that ln ZEC (Λ1 , Λ0 ) must be stationary with respect to variations of Λ1 and Λ0 . [sent-98, score-0.404]

54 Within this EC approximation we can carry out the replica limit ED [ln Z] ≈ ln ZEC = (n) 1 limn→0 n ln ZEC and get after some calculations − ln ZEC = −ED ln − ln 1 T dx e− 2 x 1 T dx e− 2 x (D+(Λ0 −Λ)I)x (K−1 +ΛI)x + ln − (20) 1 T dx e− 2 Λ0 x x where we have set Λ = Λ0 − Λ1 . [sent-99, score-2.87]

55 Since the ﬁrst Gaussian integral factorises, we can now perform the resampling average in (20) relatively easy for the case when all sj ’s in s (3) are independent. [sent-100, score-0.633]

56 approximation for the case of resampling µN points with replacement. [sent-104, score-0.599]

57 The variational equations which make (20) stationary are ED 1 Λ0 − Λ + Di = 1 Λ0 1 N k ωi 1 = 1 + ωk Λ Λ0 (21) where ωk are the eigenvalues of the matrix K. [sent-105, score-0.302]

58 The variational equations have to be solved in the region Γ = −λ < 0 where the original partition function does not exist. [sent-106, score-0.174]

59 6 Experiments By eliminating the parameter Λ0 from (21) it is possible to reduce the numerical computations to solving a nonlinear equation for a single complex parameter Λ which can be solved easily and fast by a Newton method. [sent-108, score-0.026]

60 While the analytical results are based on Poisson statistics, the simulations of random resampling was performed by choosing a ﬁxed number (equal to the expected number of the Poisson distribution) of data at random with replacement. [sent-109, score-0.616]

61 The ﬁrst experiment was for a set of data generated at random from a spherical Gaussian. [sent-110, score-0.061]

62 To show that resampling maybe useful, we give on on the left hand side of Figure 1 the reconstruction error as a function of the value of λ below which eigenvalues are dicarded. [sent-111, score-0.87]

63 25 20 15 10 5 0 0 1 2 3 eigenvalue λ 4 5 Figure 1: Left: Errors for PCA on N = 32 spherically Gaussian data with d = 25 and µ = 3. [sent-112, score-0.064]

64 Smooth curve: approximate resampled error estimate, upper step function: true error. [sent-113, score-0.302]

65 Right: Comparison of EC approximation (line) and simulation (histogramme) of the resampled density of eigenvalues for N = 50 spherically Gaussian data of dimensionality d = 25. [sent-115, score-0.502]

66 The smooth function is the approximate resampling error (3× oversampled to leave not many data out of the samples) from our method. [sent-117, score-0.672]

67 The upper step function gives the true reconstruction error (easy to calculate for spherical data) from (1). [sent-118, score-0.245]

68 The right panel demonstrates the accuracy of the approximation on a similar set of data. [sent-120, score-0.098]

69 We compare the analytically approximated density of states with the results of a true resampling experiment, where eigenvalues for many samples are counted into small bins. [sent-121, score-0.747]

70 Since the good accuracy might be attributed to the high symmetry of the toy data, we have also performed experiments on a set of N = 100 handwritten digits with d = 784. [sent-123, score-0.083]

71 Although the density of eigenvalues is more accurate than the resampling error, the latter comes still out reasonable. [sent-125, score-0.747]

72 7 Corrections I will show next that the EC approximation can be augmented by a perturbation expansion. [sent-126, score-0.106]

73 Going back to (19), we can write Z (n) = Z1 T dx p1 (x) ψ2 (x) eΛ1 x x = 1 T dx ψ2 (x) e 2 Λx x T dk e−ik x χ(k) Nn (2π) T . [sent-127, score-0.446]

74 where χ(k) = dx p1 (x)eik x is the characteristic function of the density p1 (18). [sent-128, score-0.284]

75 Using the symmetries of the density p1 , we can perform a power series expansion of ln χ(k), which starts with a quadratic term (second cumulant) M2 T ln χ(k) = − k k + R(k) , (22) 2 where M2 = xT xa 1 . [sent-130, score-0.849]

76 It can be shown that if we neglect R(k) (containing the higher order a cumulants) and carry out the integral over k, we end up replacing p1 by a simpler Gaussian p0 with matching moments M2 , i. [sent-131, score-0.117]

77 Higher order corrections to the free energy −ED [ln Z] = − ln ZEC + ∆F1 + . [sent-134, score-0.539]

78 This expansion is similar in spirit to Edgeworth 20 15 10 5 0 0 eigenvalue λ 0. [sent-141, score-0.06]

79 5 Figure 2: Left: Resampling error (µ = 1) for PCA on a set of 100 handwritten digits (“5”) with d = 784. [sent-143, score-0.148]

80 The approximation (line) for µ = 1 is compared with simulations of the random resampling. [sent-144, score-0.107]

81 Right: Resampled density of eigenvalues for the same data set. [sent-145, score-0.209]

82 The left panel of Figure 3 demonstrates that the true correction is fairly small. [sent-151, score-0.188]

83 The right panel shows that the lowest order term ∆F1 accounts for a major part of the true correction when µ < 3. [sent-152, score-0.226]

84 The strong underestimation for larger µ needs further investigation. [sent-153, score-0.034]

85 8 The calculation without replicas Knowing with hindsight how the ﬁnal EC result (20) looks like, we can rederive it using another method which does not rely on the “replica trick”. [sent-154, score-0.11]

86 We ﬁrst write down an exact expression for − ln Z before averaging. [sent-155, score-0.336]

87 Expressing Gaussian integrals by determinants yields − ln Z = − ln + ln 1 T dx e− 2 x 1 (D+(Λ0 −Λ)I)x T dx e− 2 Λ0 x x + where the matrix r has elements rij = 1 − 1 T dx e− 2 x − ln (K−1 +ΛI)x + (24) 1 ln det(I + r) 2 Λ0 Λ0 −Λ+Di Λ0 K−1 + ΛI −1 −I ij . [sent-156, score-2.492]

88 The EC approximation is obtained by simply neglecting r. [sent-157, score-0.089]

89 Corrections to this are found by expanding ∞ (−1)k+1 Tr rk (25) ln det (I + r) = Tr ln (I + r) = k k=1 Correction to resampling error 0. [sent-158, score-1.318]

90 6 22 Resampled reconstruction error (λ = 0) 20 18 16 0. [sent-159, score-0.184]

91 5 4 0 Figure 3: Left: Resampling error Er from the λ = 0 subspace as a function of resampling rate for the digits data. [sent-169, score-0.704]

92 The approximation (lower line) is compared with simulations of the random resampling (upper line). [sent-170, score-0.645]

93 Right: The difference between approximation and simulations (upper curve) and its estimate (lower curve) from the perturbative correction (23). [sent-171, score-0.37]

94 The ﬁrst order term in the expansion (25) vanishes after averaging (see (21)) and the second order term gives exactly the correction of the cumulant method (23). [sent-172, score-0.282]

95 9 Outlook It will be interesting to extend the perturbative framework for the computation of corrections to inference approximations to other, more complex models. [sent-173, score-0.275]

96 However, our results indicate that the use and convergence of such perturbation expansion needs to be critically investigated and that the lowest order may not always give a clear indication of the accuracy of the approximation. [sent-174, score-0.177]

97 Rattray Limiting form of the sample covariance matrix eigenspectrum in PCA and kernel PCA. [sent-199, score-0.089]

98 Van den Broeck, Statistical Mechanics of Learning (Cambridge University Press, 2001). [sent-203, score-0.028]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('resampling', 0.538), ('ln', 0.336), ('dx', 0.223), ('ec', 0.213), ('zec', 0.193), ('resampled', 0.168), ('correction', 0.151), ('pca', 0.149), ('eigenvalues', 0.148), ('tr', 0.121), ('reconstruction', 0.119), ('perturbative', 0.112), ('yj', 0.098), ('ole', 0.096), ('ul', 0.096), ('replica', 0.095), ('corrections', 0.095), ('partition', 0.088), ('pq', 0.084), ('winther', 0.084), ('lim', 0.079), ('resample', 0.076), ('xt', 0.073), ('cumulant', 0.071), ('approximate', 0.069), ('inference', 0.068), ('sj', 0.067), ('ut', 0.067), ('poisson', 0.065), ('error', 0.065), ('malzahn', 0.064), ('spherically', 0.064), ('yyt', 0.064), ('opper', 0.064), ('yk', 0.063), ('approximation', 0.061), ('integrals', 0.061), ('spherical', 0.061), ('density', 0.061), ('expansion', 0.06), ('variational', 0.06), ('southampton', 0.056), ('xa', 0.056), ('free', 0.055), ('matrix', 0.054), ('er', 0.054), ('energy', 0.053), ('expectation', 0.053), ('subspace', 0.052), ('manfred', 0.051), ('replicas', 0.051), ('continuation', 0.051), ('gaussian', 0.05), ('digits', 0.049), ('di', 0.048), ('simulations', 0.046), ('si', 0.046), ('perturbation', 0.045), ('det', 0.043), ('projection', 0.041), ('stationary', 0.04), ('uk', 0.038), ('intractable', 0.038), ('lowest', 0.038), ('eigenvectors', 0.037), ('panel', 0.037), ('trick', 0.036), ('analytic', 0.036), ('expressing', 0.036), ('curve', 0.036), ('contained', 0.036), ('green', 0.036), ('dimensional', 0.036), ('ed', 0.035), ('covariance', 0.035), ('averaged', 0.035), ('moments', 0.034), ('handwritten', 0.034), ('needs', 0.034), ('yi', 0.033), ('analytical', 0.032), ('calculation', 0.031), ('averages', 0.031), ('physics', 0.03), ('carry', 0.029), ('expressed', 0.028), ('integral', 0.028), ('den', 0.028), ('determinants', 0.028), ('minka', 0.028), ('monographs', 0.028), ('monomials', 0.028), ('multiplicities', 0.028), ('neglecting', 0.028), ('optimistically', 0.028), ('rederive', 0.028), ('renormalization', 0.028), ('diagonal', 0.027), ('replacing', 0.026), ('solved', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

Author: Manfred Opper

2 0.098390438 2 nips-2005-A Bayes Rule for Density Matrices

Author: Manfred K. Warmuth

Abstract: The classical Bayes rule computes the posterior model probability from the prior probability and the data likelihood. We generalize this rule to the case when the prior is a density matrix (symmetric positive deﬁnite and trace one) and the data likelihood a covariance matrix. The classical Bayes rule is retained as the special case when the matrices are diagonal. In the classical setting, the calculation of the probability of the data is an expected likelihood, where the expectation is over the prior distribution. In the generalized setting, this is replaced by an expected variance calculation where the variance is computed along the eigenvectors of the prior density matrix and the expectation is over the eigenvalues of the density matrix (which form a probability vector). The variances along any direction is determined by the covariance matrix. Curiously enough this expected variance calculation is a quantum measurement where the co-variance matrix speciﬁes the instrument and the prior density matrix the mixture state of the particle. We motivate both the classical and the generalized Bayes rule with a minimum relative entropy principle, where the Kullbach-Leibler version gives the classical Bayes rule and Umegaki’s quantum relative entropy the new Bayes rule for density matrices. 1

3 0.095188387 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf

Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.

4 0.087142646 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis

Author: Laurent Zwald, Gilles Blanchard

Abstract: This paper presents a non-asymptotic statistical analysis of Kernel-PCA with a focus different from the one proposed in previous work on this topic. Here instead of considering the reconstruction error of KPCA we are interested in approximation error bounds for the eigenspaces themselves. We prove an upper bound depending on the spacing between eigenvalues but not on the dimensionality of the eigenspace. As a consequence this allows to infer stability results for these estimated spaces. 1 Introduction. Principal Component Analysis (PCA for short in the sequel) is a widely used tool for data dimensionality reduction. It consists in ﬁnding the most relevant lower-dimension projection of some data in the sense that the projection should keep as much of the variance of the original data as possible. If the target dimensionality of the projected data is ﬁxed in advance, say D – an assumption that we will make throughout the present paper – the solution of this problem is obtained by considering the projection on the span SD of the ﬁrst D eigenvectors of the covariance matrix. Here by ’ﬁrst D eigenvectors’ we mean eigenvectors associated to the D largest eigenvalues counted with multiplicity; hereafter with some abuse the span of the ﬁrst D eigenvectors will be called “D-eigenspace” for short when there is no risk of confusion. The introduction of the ’Kernel trick’ has allowed to extend this methodology to data mapped in a kernel feature space, then called KPCA [8]. The interest of this extension is that, while still linear in feature space, it gives rise to nonlinear interpretation in original space – vectors in the kernel feature space can be interpreted as nonlinear functions on the original space. For PCA as well as KPCA, the true covariance matrix (resp. covariance operator) is not known and has to be estimated from the available data, an procedure which in the case of ¨ Kernel spaces is linked to the so-called Nystrom approximation [13]. The subspace given as an output is then obtained as D-eigenspace SD of the empirical covariance matrix or operator. An interesting question from a statistical or learning theoretical point of view is then, how reliable this estimate is. This question has already been studied [10, 2] from the point of view of the reconstruction error of the estimated subspace. What this means is that (assuming the data is centered in Kernel space for simplicity) the average reconstruction error (square norm of the distance to the projection) of SD converges to the (optimal) reconstruction error of SD and that bounds are known about the rate of convergence. However, this does not tell us much about the convergence of SD to SD – since two very different subspaces can have a very similar reconstruction error, in particular when some eigenvalues are very close to each other (the gap between the eigenvalues will actually appear as a central point of the analysis to come). In the present work, we set to study the behavior of these D-eigenspaces themselves: we provide ﬁnite sample bounds describing the closeness of the D-eigenspaces of the empirical covariance operator to the true one. There are several broad motivations for this analysis. First, the reconstruction error alone is a valid criterion only if one really plans to perform dimensionality reduction of the data and stop there. However, PCA is often used merely as a preprocessing step and the projected data is then submitted to further processing (which could be classiﬁcation, regression or something else). In particular for KPCA, the projection subspace in the kernel space can be interpreted as a subspace of functions on the original space; one then expects these functions to be relevant for the data at hand and for some further task (see e.g. [3]). In these cases, if we want to analyze the full procedure (from a learning theoretical sense), it is desirable to have a more precise information on the selected subspace than just its reconstruction error. In particular, from a learning complexity point of view, it is important to ensure that functions used for learning stay in a set of limited complexity, which is ensured if the selected subspace is stable (which is a consequence of its convergence). The approach we use here is based on perturbation bounds and we essentially walk in the steps pioneered by Kolchinskii and Gin´ [7] (see also [4]) using tools of operator perturbae tion theory [5]. Similar methods have been used to prove consistency of spectral clustering [12, 11]. An important difference here is that we want to study directly the convergence of the whole subspace spanned by the ﬁrst D eigenvectors instead of the separate convergence of the individual eigenvectors; in particular we are interested in how D acts as a complexity parameter. The important point in our main result is that it does not: only the gap between the D-th and the (D + 1)-th eigenvalue comes into account. This means that there in no increase in complexity (as far as this bound is concerned: of course we cannot exclude that better bounds can be obtained in the future) between estimating the D-th eigenvector alone or the span of the ﬁrst D eigenvectors. Our contribution in the present work is thus • to adapt the operator perturbation result of [7] to D-eigenspaces. • to get non-asymptotic bounds on the approximation error of Kernel-PCA eigenspaces thanks to the previous tool. In section 2 we introduce shortly the notation, explain the main ingredients used and obtain a ﬁrst bound based on controlling separately the ﬁrst D eigenvectors, and depending on the dimension D. In section 3 we explain why the ﬁrst bound is actually suboptimal and derive an improved bound as a consequence of an operator perturbation result that is more adapted to our needs and deals directly with the D-eigenspace as a whole. Section 4 concludes and discusses the obtained results. Mathematical proofs are found in the appendix. 2 First result. Notation. The interest variable X takes its values in some measurable space X , following the distribution P . We consider KPCA and are therefore primarily interested in the mapping of X into a reproducing kernel Hilbert space H with kernel function k through the feature mapping ϕ(x) = k(x, ·). The objective of the kernel PCA procedure is to recover a D-dimensional subspace SD of H such that the projection of ϕ(X) on SD has maximum averaged squared norm. All operators considered in what follows are Hilbert-Schmidt and the norm considered for these operators will be the Hilbert-Schmidt norm unless precised otherwise. Furthermore we only consider symmetric nonnegative operators, so that they can be diagonalized and have a discrete spectrum. Let C denote the covariance operator of variable ϕ(X). To simplify notation we assume that nonzero eigenvalues λ1 > λ2 > . . . of C are all simple (This is for convenience only. In the conclusion we discuss what changes have to be made if this is not the case). Let φ1 , φ2 , . . . be the associated eigenvectors. It is well-known that the optimal D-dimensional reconstruction space is SD = span{φ1 , . . . , φD }. The KPCA procedure approximates this objective by considering the empirical covariance operator, denoted Cn , and the subspace SD spanned by its ﬁrst D eigenvectors. We denote PSD , PSD the orthogonal projectors on b these spaces. A ﬁrst bound. Broadly speaking, the main steps required to obtain the type of result we are interested in are 1. A non-asympotic bound on the (Hilbert-Schmidt) norm of the difference between the empirical and the true covariance operators; 2. An operator perturbation result bounding the difference between spectral projectors of two operators by the norm of their difference. The combination of these two steps leads to our goal. The ﬁrst step consists in the following Lemma coming from [9]: Lemma 1 (Corollary 5 of [9]) Supposing that supx∈X k(x, x) ≤ M , with probability greater than 1 − e−ξ , 2M ξ Cn − C ≤ √ 1+ . 2 n As for the second step, [7] provides the following perturbation bound (see also e.g. [12]): Theorem 2 (Simpliﬁed Version of [7], Theorem 5.2 ) Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple positive eigenvalues λ 1 > 1 λ2 > . . . For an integer r such that λr > 0, let δr = δr ∧ δr−1 where δr = 2 (λr − λr+1 ). Let B ∈ HS(H) be another symmetric operator such that B < δr /2 and (A + B) is still a positive operator with simple nonzero eigenvalues. Let Pr (A) (resp. Pr (A + B)) denote the orthogonal projector onto the subspace spanned by the r-th eigenvector of A (resp. (A + B)). Then, these projectors satisfy: Pr (A) − Pr (A + B) ≤ 2 B δr . Remark about the Approximation Error of the Eigenvectors: let us recall that a control over the Hilbert-Schmidt norm of the projections onto eigenspaces imply a control on the approximation errors of the eigenvectors themselves. Indeed, let φr , ψr denote the (normalized) r-th eigenvectors of the operators above with signs chosen so that φ r , ψr > 0. Then P φ r − P ψr 2 2 = 2(1 − φr , ψr ) ≥ 2(1 − φr , ψr ) = φr − ψr 2 . Now, the orthogonal projector on the direct sum of the ﬁrst D eigenspaces is the sum D r=1 Pr . Using the triangle inequality, and combining Lemma 1 and Theorem 2, we conclude that with probability at least 1 − e−ξ the following holds: D P SD − P SD ≤ b −1 δr r=1 4M √ n 1+ ξ 2 , 2 provided that n ≥ 16M 2 1 + ξ 2 −2 (sup1≤r≤D δr ) . The disadvantage of this bound is that we are penalized on the one hand by the (inverse) gaps between the eigenvalues, and on the other by the dimension D (because we have to sum the inverse gaps from 1 to D). In the next section we improve the operator perturbation bound to get an improved result where only the gap δD enters into account. 3 Improved Result. We ﬁrst prove the following variant on the operator perturbation property which better corresponds to our needs by taking directly into account the projection on the ﬁrst D eigenvectors at once. The proof uses the same kind of techniques as in [7]. Theorem 3 Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space H with simple nonzero eigenvalues λ1 > λ2 > . . . Let D > 0 be an integer such that λD > 0, δD = 1 (λD − λD+1 ). Let B ∈ HS(H) be another symmetric operator such that 2 B < δD /2 and (A + B) is still a positive operator. Let P D (A) (resp. P D (A + B)) denote the orthogonal projector onto the subspace spanned by the ﬁrst D eigenvectors A (resp. (A + B)). Then these satisfy: P D (A) − P D (A + B) ≤ B . δD (1) This then gives rise to our main result on KPCA: Theorem 4 Assume that supx∈X k(x, x) ≤ M . Let SD , SD be the subspaces spanned by the ﬁrst D eigenvectors of C, resp. Cn deﬁned earlier. Denoting λ1 > λ2 > . . . the 1 eigenvalues of C, if D > 0 is such that λD > 0, put δD = 2 (λD − λD+1 ) and BD = 2M δD 1+ ξ 2 . 2 Then provided that n ≥ BD , the following bound holds with probability at least 1 − e−ξ : BD P SD − P SD ≤ √ . b n (2) This entails in particular ⊥ SD ⊂ g + h, g ∈ SD , h ∈ SD , h 1 Hk ≤ 2BD n− 2 g Hk . (3) The important point here is that the approximation error now only depends on D through the (inverse) gap between the D-th and (D + 1)-th eigenvalues. Note that using the results of section 2, we would have obtained exactly the same bound for estimating the D-th eigenvector only – or even a worse bound since δD = δD ∧ δD−1 appears in this case. Thus, at least from the point of view of this technique (which could still yield suboptimal bounds), there is no increase of complexity between estimating the D-th eigenvector alone and estimating the span of the ﬁrst D eigenvectors. Note that the inclusion (3) can be interpreted geometrically by saying that for any vector in SD , the √ tangent of the angle between this vector and its projection on SD is upper bounded by BD / n, which we can interpret as a stability property. Comment about the Centered Case. In the actual (K)PCA procedure, the data is actually ﬁrst empirically recentered, so that one has to consider the centered covariance operator C and its empirical counterpart C n . A result similar to Theorem 4 also holds in this case (up to some additional constant factors). Indeed, a result similar to Lemma 1 holds for the recentered operators [2]. Combined again with Theorem 3, this allows to come to similar conclusions for the “true” centered KPCA. 4 Conclusion and Discussion In this paper, ﬁnite sample size conﬁdence bounds of the eigenspaces of Kernel-PCA (the D-eigenspaces of the empirical covariance operator) are provided using tools of operator perturbation theory. This provides a ﬁrst step towards an in-depth complexity analysis of algorithms using KPCA as pre-processing, and towards taking into account the randomness of the obtained models (e.g. [3]). We proved a bound in which the complexity factor for estimating the eigenspace SD by its empirical counterpart depends only on the inverse gap between the D-th and (D + 1)-th eigenvalues. In addition to the previously cited works, we take into account the centering of the data and obtain comparable rates. In this work we assumed for simplicity of notation the eigenvalues to be simple. In the case the covariance operator C has nonzero eigenvalues with multiplicities m1 , m2 , . . . possibly larger than one, the analysis remains the same except for one point: we have to assume that the dimension D of the subspaces considered is of the form m1 + · · · + mr for a certain r. This could seem restrictive in comparison with the results obtained for estimating the sum of the ﬁrst D eigenvalues themselves [2] (which is linked to the reconstruction error in KPCA) where no such restriction appears. However, it should be clear that we need this restriction when considering D−eigenspaces themselves since the target space has to be unequivocally deﬁned, otherwise convergence cannot occur. Thus, it can happen in this special case that the reconstruction error converges while the projection space itself does not. Finally, a common point of the two analyses (over the spectrum and over the eigenspaces) lies in the fact that the bounds involve an inverse gap in the eigenvalues of the true covariance operator. Finally, how tight are these bounds and do they at least carry some correct qualitative information about the behavior of the eigenspaces? Asymptotic results (central limit Theorems) in [6, 4] always provide the correct goal to shoot for since they actually give the limit distributions of these quantities. They imply that there is still important ground to cover before bridging the gap between asymptotic and non-asymptotic. This of course opens directions for future work. Acknowledgements: This work was supported in part by the PASCAL Network of Excellence (EU # 506778). A Appendix: proofs. Proof of Lemma 1. This lemma is proved in [9]. We give a short proof for the sake of n 1 completness. Cn − C = n i=1 CXi − E [CX ] with CX = ϕ(X) ⊗ ϕ(X)∗ = k(X, X) ≤ M . We can apply the bounded difference inequality to the variable Cn − C , so that with probability greater than 1 − e−ξ , Cn − C ≤ E [ Cn − C ] + 2M Moreover, by Jensen’s inequality E [ Cn − C ] ≤ E n 1 simple calculations leads to E n i=1 CXi − E [CX ] 2 4M n . This concludes the proof of lemma 1. 1 n 2 n i=1 CXi 1 = nE 2 ξ 2n . 1 2 − E [CX ] , and CX − E [CX ] 2 ≤ Proof of Theorem 3. The variation of this proof with respect to Theorem 5.2 in [7] is (a) to work directly in a (inﬁnite-dimensional) Hilbert space, requiring extra caution for some details and (b) obtaining an improved bound by considering D-eigenspaces at once. The key property of Hilbert-Schmidt operators allowing to work directly in a inﬁnite dimensional setting is that HS(H) is a both right and left ideal of Lc (H, H), the Banach space of all continuous linear operators of H endowed with the operator norm . op . Indeed, ∀ T ∈ HS(H), ∀S ∈ Lc (H, H), T S and ST belong to HS(H) with TS ≤ T S ST ≤ T and op S op . (4) The spectrum of an Hilbert-Schmidt operator T is denoted Λ(T ) and the sequence of eigenvalues in non-increasing order is denoted λ(T ) = (λ1 (T ) ≥ λ2 (T ) ≥ . . .) . In the following, P D (T ) denotes the orthogonal projector onto the D-eigenspace of T . The Hoffmann-Wielandt inequality in inﬁnite dimensional setting[1] yields that: λ(A) − λ(A + B) 2 ≤ B ≤ δD . 2 (5) implying in particular that ∀i > 0, |λi (A) − λi (A + B)| ≤ δD . 2 (6) Results found in [5] p.39 yield the formula P D (A) − P D (A + B) = − 1 2iπ γ (RA (z) − RA+B (z))dz ∈ Lc (H, H) . (7) where RA (z) = (A − z Id)−1 is the resolvent of A, provided that γ is a simple closed curve in C enclosing exactly the ﬁrst D eigenvalues of A and (A + B). Moreover, the same reference (p.60) states that for ξ in the complementary of Λ(A), RA (ξ) op = dist(ξ, Λ(A)) −1 . (8) The proof of the theorem now relies on the simple choice for the closed curve γ in (7), drawn in the picture below and consisting of three straight lines and a semi-circle of radius D L. For all L > δ2 , γ intersect neither the eigenspectrum of A (by equation (6)) nor the eigenspectrum of A + B. Moreover, the eigenvalues of A (resp. A + B) enclosed by γ are exactly λ1 (A), . . . , λD (A) (resp. λ1 (A + B), . . . , λD (A + B)). Moreover, for z ∈ γ, T (z) = RA (z) − RA+B (z) = −RA+B (z)BRA (z) belongs to HS(H) and depends continuously on z by (4). Consequently, P D (A) − P D (A + B) ≤ 1 2π b a (RA − RA+B )(γ(t)) |γ (t)|dt . N Let SN = n=0 (−1)n (RA (z)B)n RA (z). RA+B (z) = (Id + RA (z)B)−1 RA (z) and, for z ∈ γ and L > δD , RA (z)B op ≤ RA (z) op B ≤ δD 1 ≤ , 2 dist(z, Λ(A)) 2 γ L L δD λ 0 D+1 δD λ2 λD λ1 δD δD δD 2 2 2 L . op imply that SN −→ RA+B (z) (uniformly for z ∈ γ). Using property (4), since B ∈ . HS(H), SN BRA (z) −→ RA+B (z)BRA (z) = RA+B (z) − RA (z) . Finally, RA (z) − RA+B (z) = (−1)n (RA (z)B)n RA (z) n≥1 where the series converges in HS(H), uniformly in z ∈ γ. Using again property (4) and (8) implies B n (RA − RA+B )(γ(t)) ≤ RA (γ(t)) n+1 B n ≤ op distn+1 (γ(t), Λ(A)) n≥1 Finally, since for L > δD , B ≤ δD 2 n≥1 ≤ dist(γ(t),Λ(A)) , 2 b B 1 |γ (t)|dt . 2 (γ(t), Λ(A)) π a dist Splitting the last integral into four parts according to the deﬁnition of the contour γ, we obtain P D (A) − P D (A + B) ≤ 2arctan( δL ) 1 µ1 (A) − (µD (A) − δD ) π D |γ (t)|dt ≤ + +2 , dist2 (γ(t), Λ(A)) δD L L2 a and letting L goes to inﬁnity leads to the result. b Proof of Theorem 4. Lemma 1 and Theorem 3 yield inequality (2). Together with as1 2 sumption n ≥ BD it implies PSD − PSD ≤ 2 . Let f ∈ SD : f = PSD (f ) + PSD (f ) . ⊥ b Lemma 5 below with F = SD and G = SD , and the fact that the operator norm is bounded by the Hilbert-Schmidt norm imply that 4 PSD (f ) 2 k ≤ PSD − PSD 2 PSD (f ) 2 k . ⊥ b H H 3 Gathering the different inequalities, Theorem 4 is proved. Lemma 5 Let F and G be two vector subspaces of H such that PF − PG the following bound holds: 4 PF − PG 2 PF (f ) 2 . ∀ f ∈ G , PF ⊥ (f ) 2 ≤ H op H 3 op 1 ≤ 2 . Then Proof of Lemma 5. = f − PF (f ) 2 = (PG − PF )(f ) 2 = PF − PF ⊥ (f ) 2 For f ∈ G, we have PG (f ) = f , hence PF (f ) 2 op PG 2 op ≤ P F − PG gathering the terms containing PF ⊥ (f ) 1/4 leads to the conclusion. 2 f 2 2 + PF ⊥ (f ) 2 on the left-hand side and using PF −PG 2 op ≤ References [1] R. Bhatia and L. Elsner. The Hoffman-Wielandt inequality in inﬁnite dimensions. Proc.Indian Acad.Sci(Math. Sci.) 104 (3), p. 483-494, 1994. [2] G. Blanchard, O. Bousquet, and L. Zwald. Statistical Properties of Kernel Principal Component Analysis. Proceedings of the 17th. Conference on Learning Theory (COLT 2004), p. 594–608. Springer, 2004. [3] G. Blanchard, P. Massart, R. Vert, and L. Zwald. Kernel projection machine: a new tool for pattern recognition. Proceedings of the 18th. Neural Information Processing System (NIPS 2004), p. 1649–1656. MIT Press, 2004. [4] J. Dauxois, A. Pousse, and Y. Romain. Asymptotic theory for the Principal Component Analysis of a vector random function: some applications to statistical inference. Journal of multivariate analysis 12, 136-154, 1982. [5] T. Kato. Perturbation Theory for Linear Operators. New-York: Springer-Verlag, 1966. [6] V. Koltchinskii. Asymptotics of spectral projections of some random matrices approximating integral operators. Progress in Probability, 43:191–227, 1998. [7] V. Koltchinskii and E. Gin´ . Random matrix approximation of spectra of integral e operators. Bernoulli, 6(1):113–167, 2000. [8] B. Sch¨ lkopf, A. J. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a o u kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. [9] J. Shawe-Taylor and N. Cristianini. Estimating the moments of a random vector with applications. Proceedings of the GRETSI 2003 Conference, p. 47-52, 2003. [10] J. Shawe-Taylor, C. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the Gram matrix and the generalisation error of Kernel PCA. IEEE Transactions on Information Theory 51 (7), p. 2510-2522, 2005. [11] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Technical Report 134, Max Planck Institute for Biological Cybernetics, 2004. [12] U. von Luxburg, O. Bousquet, and M. Belkin. On the convergence of spectral clustering on random samples: the normalized case. Proceedings of the 17th Annual Conference on Learning Theory (COLT 2004), p. 457–471. Springer, 2004. [13] C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classiﬁers. Proceedings of the 17th International Conference on Machine Learning (ICML), p. 1159–1166. Morgan Kaufmann, 2000.

5 0.082402259 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

Author: Firas Hamze, Nando de Freitas

Abstract: This paper presents a new sampling algorithm for approximating functions of variables representable as undirected graphical models of arbitrary connectivity with pairwise potentials, as well as for estimating the notoriously difﬁcult partition function of the graph. The algorithm ﬁts into the framework of sequential Monte Carlo methods rather than the more widely used MCMC, and relies on constructing a sequence of intermediate distributions which get closer to the desired one. While the idea of using “tempered” proposals is known, we construct a novel sequence of target distributions where, rather than dropping a global temperature parameter, we sequentially couple individual pairs of variables that are, initially, sampled exactly from a spanning tree of the variables. We present experimental results on inference and estimation of the partition function for sparse and densely-connected graphs.

6 0.079492353 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

7 0.078920417 189 nips-2005-Tensor Subspace Analysis

8 0.077761628 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

9 0.074107498 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

10 0.073266953 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels

11 0.070483498 80 nips-2005-Gaussian Process Dynamical Models

12 0.068946213 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes

13 0.068258561 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

14 0.064999118 30 nips-2005-Assessing Approximations for Gaussian Process Classification

15 0.064651139 45 nips-2005-Conditional Visual Tracking in Kernel Space

16 0.063206621 125 nips-2005-Message passing for task redistribution on sparse graphs

17 0.063147381 102 nips-2005-Kernelized Infomax Clustering

18 0.060906973 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

19 0.060126476 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

20 0.059627343 126 nips-2005-Metric Learning by Collapsing Classes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.197), (1, 0.067), (2, -0.026), (3, -0.024), (4, 0.029), (5, -0.078), (6, -0.057), (7, -0.058), (8, 0.041), (9, -0.04), (10, 0.074), (11, 0.029), (12, -0.018), (13, -0.001), (14, -0.042), (15, 0.059), (16, -0.072), (17, -0.072), (18, -0.023), (19, -0.101), (20, 0.114), (21, -0.005), (22, 0.051), (23, 0.185), (24, 0.149), (25, 0.073), (26, 0.098), (27, -0.038), (28, 0.078), (29, 0.053), (30, 0.051), (31, -0.03), (32, 0.023), (33, -0.06), (34, 0.109), (35, 0.059), (36, 0.081), (37, -0.112), (38, 0.003), (39, -0.141), (40, 0.062), (41, -0.073), (42, 0.004), (43, -0.076), (44, -0.037), (45, 0.087), (46, 0.001), (47, -0.187), (48, 0.007), (49, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94833016 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

Author: Manfred Opper

2 0.53975737 189 nips-2005-Tensor Subspace Analysis

Author: Xiaofei He, Deng Cai, Partha Niyogi

Abstract: Previous work has demonstrated that the image variations of many objects (human faces in particular) under variable lighting can be effectively modeled by low dimensional linear spaces. The typical linear subspace learning algorithms include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Locality Preserving Projection (LPP). All of these methods consider an n1 × n2 image as a high dimensional vector in Rn1 ×n2 , while an image represented in the plane is intrinsically a matrix. In this paper, we propose a new algorithm called Tensor Subspace Analysis (TSA). TSA considers an image as the second order tensor in Rn1 ⊗ Rn2 , where Rn1 and Rn2 are two vector spaces. The relationship between the column vectors of the image matrix and that between the row vectors can be naturally characterized by TSA. TSA detects the intrinsic local geometrical structure of the tensor space by learning a lower dimensional tensor subspace. We compare our proposed approach with PCA, LDA and LPP methods on two standard databases. Experimental results demonstrate that TSA achieves better recognition rate, while being much more efﬁcient. 1

3 0.53573513 2 nips-2005-A Bayes Rule for Density Matrices

Author: Manfred K. Warmuth

4 0.51363754 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes

Author: Yunsong Huang, B. Keith Jenkins

Abstract: We develop an approach for estimation with Gaussian Markov processes that imposes a smoothness prior while allowing for discontinuities. Instead of propagating information laterally between neighboring nodes in a graph, we study the posterior distribution of the hidden nodes as a whole—how it is perturbed by invoking discontinuities, or weakening the edges, in the graph. We show that the resulting computation amounts to feed-forward fan-in operations reminiscent of V1 neurons. Moreover, using suitable matrix preconditioners, the incurred matrix inverse and determinant can be approximated, without iteration, in the same computational style. Simulation results illustrate the merits of this approach.

5 0.50103128 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms

Author: Baback Moghaddam, Yair Weiss, Shai Avidan

Abstract: Sparse PCA seeks approximate sparse “eigenvectors” whose projections capture the maximal variance of data. As a cardinality-constrained and non-convex optimization problem, it is NP-hard and is encountered in a wide range of applied ﬁelds, from bio-informatics to ﬁnance. Recent progress has focused mainly on continuous approximation and convex relaxation of the hard cardinality constraint. In contrast, we consider an alternative discrete spectral formulation based on variational eigenvalue bounds and provide an effective greedy strategy as well as provably optimal solutions using branch-and-bound search. Moreover, the exact methodology used reveals a simple renormalization step that improves approximate solutions obtained by any continuous method. The resulting performance gain of discrete algorithms is demonstrated on real-world benchmark data and in extensive Monte Carlo evaluation trials. 1

6 0.45408982 162 nips-2005-Rate Distortion Codes in Sensor Networks: A System-level Analysis

7 0.43296331 133 nips-2005-Nested sampling for Potts models

8 0.42332104 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

9 0.41360494 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis

10 0.40688649 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels

11 0.39464423 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

12 0.39373302 192 nips-2005-The Information-Form Data Association Filter

13 0.36709943 71 nips-2005-Fast Krylov Methods for N-Body Learning

14 0.36363339 62 nips-2005-Efficient Estimation of OOMs

15 0.35977849 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

16 0.35741431 105 nips-2005-Large-Scale Multiclass Transduction

17 0.34631875 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

18 0.34312594 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

19 0.3367289 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

20 0.33648053 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.05), (8, 0.226), (10, 0.071), (18, 0.039), (27, 0.031), (31, 0.05), (34, 0.127), (50, 0.013), (55, 0.027), (69, 0.034), (73, 0.052), (88, 0.109), (91, 0.075)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85329545 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire

Abstract: We study the problem of maximum entropy density estimation in the presence of known sample selection bias. We propose three bias correction approaches. The ﬁrst one takes advantage of unbiased sufﬁcient statistics which can be obtained from biased samples. The second one estimates the biased distribution and then factors the bias out. The third one approximates the second by only using samples from the sampling distribution. We provide guarantees for the ﬁrst two approaches and evaluate the performance of all three approaches in synthetic experiments and on real data from species habitat modeling, where maxent has been successfully applied and where sample selection bias is a signiﬁcant problem. 1

same-paper 2 0.84585345 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

Author: Manfred Opper

3 0.65972245 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

Author: Alan L. Yuille

Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1

4 0.65732127 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard

Abstract: There have been many graph-based approaches for semi-supervised classiﬁcation. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classiﬁcation. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classiﬁcation as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classiﬁcation. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classiﬁer to classify new points. Tests on synthetic and real datasets show cases where there are signiﬁcant improvements in performance over the existing approaches. 1

5 0.65325689 23 nips-2005-An Application of Markov Random Fields to Range Sensing

Author: James Diebel, Sebastian Thrun

Abstract: This paper describes a highly successful application of MRFs to the problem of generating high-resolution range images. A new generation of range sensors combines the capture of low-resolution range images with the acquisition of registered high-resolution camera images. The MRF in this paper exploits the fact that discontinuities in range and coloring tend to co-align. This enables it to generate high-resolution, low-noise range images by integrating regular camera images into the range data. We show that by using such an MRF, we can substantially improve over existing range imaging technology. 1

6 0.65221411 105 nips-2005-Large-Scale Multiclass Transduction

7 0.65083176 184 nips-2005-Structured Prediction via the Extragradient Method

8 0.65007454 144 nips-2005-Off-policy Learning with Options and Recognizers

9 0.64868897 50 nips-2005-Convex Neural Networks

10 0.64668149 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

11 0.64628816 78 nips-2005-From Weighted Classification to Policy Search

12 0.64517701 177 nips-2005-Size Regularized Cut for Data Clustering

13 0.64496893 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

14 0.64273304 30 nips-2005-Assessing Approximations for Gaussian Process Classification

15 0.64256316 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

16 0.64196277 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

17 0.63986307 133 nips-2005-Nested sampling for Potts models

18 0.63935357 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

19 0.63901716 98 nips-2005-Infinite latent feature models and the Indian buffet process

20 0.63836604 114 nips-2005-Learning Rankings via Convex Hull Separation