nips nips2003 nips2003-115 knowledge-graph by maker-knowledge-mining

115 nips-2003-Linear Dependent Dimensionality Reduction

Source: pdf

Author: Nathan Srebro, Tommi S. Jaakkola

Abstract: We formulate linear dimensionality reduction as a semi-parametric estimation problem, enabling us to study its asymptotic behavior. We generalize the problem beyond additive Gaussian noise to (unknown) nonGaussian additive noise, and to unbiased non-additive models. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We generalize the problem beyond additive Gaussian noise to (unknown) nonGaussian additive noise, and to unbiased non-additive models. [sent-5, score-0.564]

2 We then study the standard (PCA) approach, show that it is appropriate for additive i. [sent-15, score-0.132]

3 noise (Section 3), and present a generic estimator that is appropriate also for unbiased non-additive models (Section 4). [sent-18, score-0.879]

4 In Section 5 we confront the non-Gaussianity directly, develop maximumlikelihood estimators in the presence of Gaussian mixture additive noise, and show that the consistency of such maximum-likelihood estimators should not be taken for granted. [sent-19, score-0.642]

5 2 Dependent Dimensionality Reduction Our starting point is the problem of identifying linear dependencies in the presence of independent identically distributed Gaussian noise. [sent-20, score-0.171]

6 In this formulation, we observe a data matrix Y ∈ n×d which we assume was generated as Y = X + Z, where the dependent, low-dimensional component X ∈ n×d (the “signal”) is a matrix of rank k and the independent component Z (the “noise”) is i. [sent-21, score-0.204]

7 We can write down the log-likelihood of X as −1 |Y −X |Fro +Const (where ||Fro is the Frobenius, σ2 or sum-squared, norm) and conclude that, regardless of the variance σ 2 , the maximumlikelihood estimator of X is the rank-k matrix minimizing the Frobenius distance. [sent-25, score-0.801]

8 The dependencies of each row y of Y are captured by a row u of U , which, through the parameters V and σ speciﬁes how each entry yi is generated independently given u. [sent-32, score-0.223]

9 yd A standard parametric analysis of the model would view u as a random vector (rather than parameters) and impose some, possibly parametric, distribution over it (interestingly, if u is Gaussian, the maximum-likelihood reconstruction is the same Frobenius low-rank approximation [4]). [sent-36, score-0.121]

10 The model class is then non-parametric, yet we still desire, and are able, to estimate a parametric aspect of the model: The estimator can be seen as a ML estimator for the signal subspace, where the distribution over u is unconstrained nuisance. [sent-38, score-1.441]

11 Although we did not impose any form on the distribution u, we did impose a strict form on the conditional distributions yi |u: we required them to be Gaussian with ﬁxed variance σ 2 and mean uVi . [sent-39, score-0.426]

12 Since u is continuous, we cannot expect to forego all restrictions on yi |ui , but we can expect to set up a semi-parametric problem in which y|u may lie in an inﬁnite dimensional family of distributions, and is not strictly parameterized. [sent-43, score-0.2]

13 Relaxing the Gaussianity leads to linear additive models y = uV + z, with z independent of u, but not necessarily Gaussian. [sent-44, score-0.165]

14 , when the noise has a multiplicative component, or when the features of y are not real numbers. [sent-47, score-0.202]

15 These types of models, with a known distribution yi |xi , have been suggested for classiﬁcation using logistic loss [5], when yi |xi forms an exponential family [6], and in a more abstract framework [7]. [sent-48, score-0.35]

16 Fitting a non-linear manifold by minimizing the sum-squared distance can be seen as a ML estimator for y|u = g(u) + z, where z is i. [sent-50, score-0.619]

17 Combining these ideas leads us to discuss the conditional distributions yi |gi (u), or yi |u directly. [sent-54, score-0.401]

18 We continue to assume a linear model x = uV and limit ourselves to additive noise models and unbiased models in which E [y|x] = x. [sent-59, score-0.392]

19 We study the estimation of the rank-k signal space in which x resides, based on a sample of n independent observations of y (forming the rows of Y), where the distribution on u is unconstrained nuisance. [sent-60, score-0.284]

20 In order to study estimators for a subspace, we must be able to compare two subspaces. [sent-61, score-0.115]

21 Deﬁne the angle between a vector v1 and a subspace V2 to be the minimal angle between v1 and any v2 ∈ V2 . [sent-63, score-0.264]

22 The largest canonical angle between two subspaces is then the maximal angle between a vector in v1 ∈ V1 and the subspace V2 . [sent-64, score-0.363]

23 It is convenient to think of a subspace in terms of the matrix whose columns span it. [sent-66, score-0.236]

24 Computationally, if the columns of V1 and V2 form orthonormal bases of V1 and V2 , then the cosines of the canonical angles between V1 and V2 are given by the singular values of V1 V2 . [sent-67, score-0.149]

25 In particular, we will denote by V0 the true signal subspace, i. [sent-69, score-0.132]

26 But the L2 estimator is appropriate also in a more general setting. [sent-77, score-0.619]

27 We will show that the L2 estimator is consistent for any i. [sent-78, score-0.657]

28 additive noise with ﬁnite variance (as we will see later on, this is more than can be said for some ML estimators). [sent-81, score-0.412]

29 , σ 2 , with the leading k eigenvectors being the eigenvectors of ΛX . [sent-93, score-0.163]

30 This ensures an eigenvalue gap of sk > 0 between the invariant subspace of ΛY spanned by the eigenvectors of ΛX and its complement, and we can bound the norm of the canonical sines between V0 and the leading k eigenvectors of ˆ −Λ ˆ ˆ Λn by |Λnsk Y | [8]. [sent-94, score-0.608]

31 4 The Variance-Ignoring Estimator We turn to additive noise with independent, but not identically distributed, coordinates. [sent-98, score-0.372]

32 If the noise variances are known, the ML estimator corresponds to minimizing the columnweighted (inversely proportional to the variances) Frobenius norm of Y − X , and can be calculated from the leading eigenvectors of a scaled empirical covariance matrix [9]. [sent-99, score-1.198]

33 when the scale of different coordinates is not known, there is no ML estimator: at least k coordinates of each y can always be exactly matched, and so the likelihood is unbounded when up to k variances approach zero. [sent-102, score-0.19]

34 3 We call this an L2 estimator not because it minimizes the matrix L2 -norm |Y − X |2 , which it does, but because it minimizes the vector L2 -norms |y − x |2 . [sent-103, score-0.679]

35 2 4 We should also be careful about signals that occupy only a proper subspace of V0 , and be satisﬁed with any rank-k subspace containing the support of x, but for simplicity of presentation we assume this does not happen and x is of full rank k. [sent-104, score-0.403]

36 2 full L2 1 2 3 4 5 6 7 spread of noise scale (max/min ratio) 8 9 10 0 variance−ignored 2 10 3 10 sample size (number of observed rows) 4 10 Figure 1: Norm of sines of canonical angles to correct subspace: (a) Random rank-2 subspaces in 10 . [sent-128, score-0.418]

37 Gaussian noise of different scales in different coordinates— between 0. [sent-129, score-0.202]

38 (b) Random rank-2 subspaces in 10 , 500 sample rows, and Gaussian noise with varying distortion (mean over 200 simulations, bars are one standard deviations tall) (c) Observations are exponentially distributed with means in rank-2 subspace ( 1 1 1 1 1 1 1 1 1 1 ) . [sent-132, score-0.449]

39 1 0 1 0 1 0 1 0 1 0 The L2 estimator is not satisfactory in this scenario. [sent-133, score-0.648]

40 The covariance matrix ΛZ is still diagonal, but is no longer a scaled identity. [sent-134, score-0.121]

41 The additional variance introduced by the noise is different in different directions, and these differences may overwhelm the “signal” variance along V0 , biasing the leading eigenvectors of ΛY , and thus the limit of the L2 estimator, toward axes with high “noise” variance. [sent-135, score-0.46]

42 The fact that this variability is independent of the variability in other coordinates is ignored, and the L2 estimator is asymptotically biased. [sent-136, score-0.745]

43 The row-space of the ˆ X is then an estimator for the signal subspace. [sent-145, score-0.751]

44 Note that the L2 estimator is the resulting Λ ˆ row-space of the rank-k matrix minimizing the unweighted sum-squared distance to Λn . [sent-146, score-0.679]

45 Figures 1(a,b) demonstrate this variance-ignoring estimator on simulated data with nonidentical Gaussian noise. [sent-147, score-0.619]

46 The estimator reconstructs the signal-space almost as well as the ML estimator, even though it does not have access to the true noise variance. [sent-148, score-0.821]

47 Discussing consistency in the presence of non-identical noise with unknown variances is problematic, since the signal subspace is not necessarily identiﬁable. [sent-149, score-0.693]

48 For example, the combined covariance matrix ΛY = ( 2 1 ) can arise from a rank-one signal covariance 12 1 ΛX = a 1/a for any 1 ≤ a ≤ 2, each corresponding to a different signal subspace. [sent-150, score-0.446]

49 Until now we have considered only additive noise, in which the distribution of yi − xi was independent of xi . [sent-154, score-0.524]

50 We will now relax this restriction and allow more general conditional distributions yi |xi , requiring only that E [yi |xi ] = xi . [sent-155, score-0.364]

51 With this requirement, together with the structural constraint (yi independent given x), for any i = j: Cov [yi , yj ] = E [yi yj ] − E [yi ]E [yj ] = E [E [yi yj |x]] − E [E [yi |x]]E [E [yj |x]] = E [E [yi |x]E [yj |x]] − E [xi ]E [xj ] = E [xi xj ] − E [xi ]E [xj ] = Cov [xi , xj ]. [sent-156, score-0.204]

52 As in the non-identical additive noise case, ΛY agrees with ΛX except on the diagonal. [sent-157, score-0.334]

53 Even if yi |xi is identically conditionally distributed for all i, the difference ΛY − ΛX is 2 2 2 not in general a scaled identity: Var [yi ] = E E yi |xi − E [yi |xi ] + E E [yi |xi ] − 2 E [yi ] = E [Var [yi |xi ]] + Var [xi ]. [sent-158, score-0.386]

54 Unlike the additive noise case, the variance of yi |xi depends on xi , and so its expectation depends on the distribution of xi . [sent-159, score-0.808]

55 Figure 1(c) demonstrates how such an estimator succeeds in reconstruction when yi |xi is exponentially distributed with mean xi , even though the standard L2 estimator is not applicable. [sent-161, score-1.595]

56 We cannot guarantee consistency because the decomposition of the covariance matrix might not be unique, but when k < d this is not likely to happen. [sent-162, score-0.231]

57 c=1 pc (2πσc ) To do so, we introduce latent variables Cij specifying the mixture component of the noise at Yij , and solve the problem using EM. [sent-165, score-0.294]

58 Equipped with a WLRA optimization method [5], we can now perform EM iteration in order to ﬁnd the matrix X maximizing the likelihood of the observed matrix Y . [sent-168, score-0.12]

59 a mixture of Gaussians with zero mean, and variance bounded from bellow. [sent-173, score-0.17]

60 We investigated two noise distributions: a ’Gaussian with outliers’ distribution formed as a mixture of two zero-mean Gaussians with widely varying variances; and a Laplace distribution p(z) ∝ e−|z| , which is an inﬁnite scale mixture of Gaussians. [sent-175, score-0.458]

61 Figures 2(a,b) show the quality of reconstruction of the L2 estimator and the ML bounded GSM estimator, for these two noise distributions, for a ﬁxed sample size of 300 rows, under varying 0. [sent-176, score-0.872]

62 4 L2 ML, known noise model ML, nuisance noise model 0. [sent-179, score-0.436]

63 4 0 10 100 1000 Sample size (number of observed rows) 10000 Figure 2: Norm of sines of canonical angles to correct subspace: (a) Random rank-3 subspace in 10 with Laplace noise. [sent-225, score-0.355]

64 We allowed ten Gaussian components, and did not observe any signiﬁcant change in the estimator when the number of components increases. [sent-238, score-0.619]

65 The ML estimator is overall more accurate than the L2 estimator—it succeeds in reliably reconstructing the low-rank signal for signals which are approximately three times weaker than those necessary for reliable reconstruction using the L2 estimator. [sent-239, score-0.834]

66 Comparison with Newton’s Methods Confronted with a general additive noise distribution, the approach presented here would be to rewrite, or approximate, it as a Gaussian mixture and use WLRA in order to learn X using EM. [sent-241, score-0.426]

67 But beyond these issues, which can be overcome, lies the major problem of Newton’s method: the noise density must be strictly log-concave and differentiable. [sent-245, score-0.242]

68 On the other hand, approximating the distribution as a Gaussians mixture and using the EM method, might still get stuck in local minima, but is at least guaranteed local improvement. [sent-249, score-0.159]

69 Consistency Despite the gains in reconstruction presented above, the ML estimator may suffer from an asymptotic bias, making it inferior to the L2 estimator on large samples. [sent-253, score-1.332]

70 The ML estimator is the minimizer of the empirical mean of the random function Φ(V ) = minu (− log p(y − uV )). [sent-256, score-0.706]

71 For the ML estimator to be consistent, E [Φ(V )] must be minimized by V0 , establishing a necessary condition for consistency. [sent-258, score-0.67]

72 It should be noted that the issue here is whether the ML estimator at all converges, since if it does converge, it must converge to the minimizer of E [Φ(V )]. [sent-260, score-0.662]

73 Such convergence can be demonstrated at least in the special case when the marginal noise density p(zi ) is continuous, strictly positive, and has ﬁnite variance and differential entropy. [sent-261, score-0.28]

74 Under these conditions, the ML estimator is consistent if and only if V0 is the unique minimizer of E [Φ(V )]. [sent-262, score-0.7]

75 When discussing E [Φ(V )], the expectation is with respect to the noise distribution and the signal distribution. [sent-263, score-0.439]

76 This is not quite satisfactory, as we would like results which are independent of the signal distribution, beyond the rank of its support. [sent-264, score-0.256]

77 For any x ∈ d , consider Ψ(V ; x ) = Ez [φ(x + z; V )], where the expectation is only over the additive noise z. [sent-267, score-0.371]

78 Under the previous conditions guaranteeing the ML estimator converges, it is consistent for any signal distribution if and only if, for all x ∈ d , Ψ(V ; x ) is minimized with respect to V exactly when x ∈ spanV . [sent-268, score-0.876]

79 It will be instructive to ﬁrst revisit the ML estimator in the presence of i. [sent-269, score-0.646]

80 the L2 estimator which we already showed is consistent. [sent-274, score-0.619]

81 We will consider the decomposition y = y + y⊥ of vectors into their projection onto the subspace V , and the residual . [sent-275, score-0.213]

82 We thus re-derived the consistency of the L2 estimator directly, for the special case in which the noise is indeed Gaussian. [sent-280, score-0.894]

83 This consistency proof employed a key property of the isotropic Gaussian: rotations of an isotropic Gaussian random variable remain i. [sent-281, score-0.137]

84 As this property is unique to Gaussian random variables, other ML estimators might not be consistent. [sent-284, score-0.115]

85 In fact, we will shortly see that the ML estimator for a known Laplace noise model is not consistent. [sent-285, score-0.821]

86 For any V = (1, α), 0 ≤ α ≤ 1, and (z1 , z2 ), the L1 norm |z + uV |1 is minimized when z1 + u = 0 yielding φ(V ; z ) = |z2 − αz1 |, ignoring a constant term, and Ψ(V ; 0) = 2 1 −|z1 |−|z2 | +α+1 |z2 − αz1 |dz1 dz2 = α α+1 , which is monotonic increasing in α in the 4e valid range [0, 1]. [sent-291, score-0.122]

87 In particular, 1 = Ψ((1, 0); 0) < Ψ((1, 1); 0) = 3 and the estimator is 2 biased towards being axis-aligned. [sent-292, score-0.619]

88 Two-component Gaussian mixture noise was added to rank-one signal in 3 , and the signal subspace was estimated using an ML estimator with known noise model, and an L2 estimator. [sent-294, score-1.555]

89 For small data sets, the ML estimator is more accurate, but as the number of samples increase, the error of the L2 estimator vanishes, while the ML estimator converges to the wrong subspace. [sent-295, score-1.857]

90 We formulate the problem of dimensionality reduction as semi-parametric estimation of the lowdimensional signal, or “factor” space, treating the signal distribution as unconstrained nuisance and the noise distribution as constrained nuisance. [sent-297, score-0.473]

91 We present an estimator which is appropriate when the conditional means E [y|u] lie in a low-dimensional linear space, and a maximum-likelihood estimator for additive Gaussian mixture noise. [sent-298, score-1.507]

92 The variance-ignoring estimator is also applicable when y can be transformed such that E [g(y)|u] lie in a low-rank linear space, e. [sent-299, score-0.619]

93 If the conditional distribution y|x is known, this amount to an unbiased estimator for xi . [sent-302, score-0.841]

94 We draw attention to the fact the maximum-likelihood low-rank estimation cannot be taken for granted, and demonstrate that it might not be consistent even for known noise models. [sent-304, score-0.24]

95 The approach employed here can also be used to investigate the consistency of ML estimators with non-additive noise models. [sent-305, score-0.39]

96 Of particular interest are distributions yi |xi that form exponential families where xi are the natural parameters [6]. [sent-306, score-0.282]

97 When the mean parameters form a low-rank linear subspace, the variance-ignoring estimator is applicable, but when the natural parameters form a linear subspace, the means are in general curved, and there is no unbiased estimator for the natural parameters. [sent-307, score-1.296]

98 Initial investigation reveals that, for example, the ML estimator for a Bernoulli (logistic) conditional distribution is not consistent. [sent-308, score-0.7]

99 The problem of ﬁnding a consistent estimator for the linear-subspace of natural parameters when yi |xi forms an exponential family remains open. [sent-309, score-0.814]

100 We also leave open the efﬁciency of the various estimators, and the problem of ﬁnding asymptotically efﬁcient estimators, and consistent estimators exhibiting the ﬁnite-sample gains of the ML estimator for additive Gaussian mixture noise. [sent-310, score-0.996]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('estimator', 0.619), ('ml', 0.363), ('noise', 0.202), ('subspace', 0.176), ('yi', 0.157), ('uv', 0.133), ('signal', 0.132), ('additive', 0.132), ('wlra', 0.131), ('yij', 0.121), ('estimators', 0.115), ('laplace', 0.093), ('mixture', 0.092), ('const', 0.087), ('frobenius', 0.087), ('variances', 0.083), ('xi', 0.083), ('variance', 0.078), ('cij', 0.076), ('sin', 0.074), ('consistency', 0.073), ('norm', 0.071), ('relaxing', 0.069), ('sines', 0.066), ('canonical', 0.062), ('covariance', 0.061), ('eigenvectors', 0.061), ('gaussian', 0.06), ('matrix', 0.06), ('unbiased', 0.058), ('zij', 0.057), ('yj', 0.057), ('pr', 0.053), ('reconstruction', 0.051), ('rank', 0.051), ('angles', 0.051), ('ignored', 0.051), ('minimized', 0.051), ('collaborative', 0.048), ('rows', 0.048), ('conditional', 0.045), ('angle', 0.044), ('confront', 0.044), ('fro', 0.044), ('gsms', 0.044), ('maximumlikelihood', 0.044), ('minu', 0.044), ('restrictions', 0.043), ('var', 0.043), ('minimizer', 0.043), ('asymptotic', 0.043), ('distributions', 0.042), ('leading', 0.041), ('beyond', 0.04), ('gaussians', 0.04), ('coordinates', 0.039), ('dependencies', 0.039), ('identically', 0.038), ('consistent', 0.038), ('srebro', 0.038), ('gaussianity', 0.038), ('sine', 0.038), ('expectation', 0.037), ('relax', 0.037), ('subspaces', 0.037), ('decomposition', 0.037), ('singular', 0.036), ('distribution', 0.036), ('spanned', 0.035), ('unconstrained', 0.035), ('sk', 0.035), ('tommi', 0.035), ('nathan', 0.035), ('additivity', 0.035), ('impose', 0.034), ('distributed', 0.034), ('independent', 0.033), ('dependent', 0.032), ('anderson', 0.032), ('discussing', 0.032), ('succeeds', 0.032), ('ez', 0.032), ('seeking', 0.032), ('nuisance', 0.032), ('factorization', 0.032), ('isotropic', 0.032), ('zi', 0.031), ('guaranteed', 0.031), ('entries', 0.031), ('newton', 0.03), ('wij', 0.029), ('satisfactory', 0.029), ('unbounded', 0.029), ('xij', 0.028), ('cov', 0.028), ('mixtures', 0.027), ('variability', 0.027), ('presence', 0.027), ('captured', 0.027), ('ltering', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 115 nips-2003-Linear Dependent Dimensionality Reduction

Author: Nathan Srebro, Tommi S. Jaakkola

2 0.24903053 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

Author: Yoshua Bengio, Yves Grandvalet

Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare algorithm performances. In order to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the estimation of uncertainty around the K-fold cross-validation estimator. The main theorem shows that there exists no universal unbiased estimator of the variance of K-fold cross-validation. An analysis based on the eigendecomposition of the covariance matrix of errors helps to better understand the nature of the problem and shows that naive estimators may grossly underestimate variance, as con£rmed by numerical experiments. 1

3 0.12488199 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

Author: Liam Paninski, Eero P. Simoncelli, Jonathan W. Pillow

Abstract: Recent work has examined the estimation of models of stimulus-driven neural activity in which some linear ﬁltering process is followed by a nonlinear, probabilistic spiking stage. We analyze the estimation of one such model for which this nonlinear step is implemented by a noisy, leaky, integrate-and-ﬁre mechanism with a spike-dependent aftercurrent. This model is a biophysically plausible alternative to models with Poisson (memory-less) spiking, and has been shown to effectively reproduce various spiking statistics of neurons in vivo. However, the problem of estimating the model from extracellular spike train data has not been examined in depth. We formulate the problem in terms of maximum likelihood estimation, and show that the computational problem of maximizing the likelihood is tractable. Our main contribution is an algorithm and a proof that this algorithm is guaranteed to ﬁnd the global optimum with reasonable speed. We demonstrate the effectiveness of our estimator with numerical simulations. A central issue in computational neuroscience is the characterization of the functional relationship between sensory stimuli and neural spike trains. A common model for this relationship consists of linear ﬁltering of the stimulus, followed by a nonlinear, probabilistic spike generation process. The linear ﬁlter is typically interpreted as the neuron’s “receptive ﬁeld,” while the spiking mechanism accounts for simple nonlinearities like rectiﬁcation and response saturation. Given a set of stimuli and (extracellularly) recorded spike times, the characterization problem consists of estimating both the linear ﬁlter and the parameters governing the spiking mechanism. One widely used model of this type is the Linear-Nonlinear-Poisson (LNP) cascade model, in which spikes are generated according to an inhomogeneous Poisson process, with rate determined by an instantaneous (“memoryless”) nonlinear function of the ﬁltered input. This model has a number of desirable features, including conceptual simplicity and computational tractability. Additionally, reverse correlation analysis provides a simple unbiased estimator for the linear ﬁlter [5], and the properties of estimators (for both the linear ﬁlter and static nonlinearity) have been thoroughly analyzed, even for the case of highly non-symmetric or “naturalistic” stimuli [12]. One important drawback of the LNP model, * JWP and LP contributed equally to this work. We thank E.J. Chichilnisky for helpful discussions. L−NLIF model LNP model )ekips(P Figure 1: Simulated responses of LNLIF and LNP models to 20 repetitions of a ﬁxed 100-ms stimulus segment of temporal white noise. Top: Raster of responses of L-NLIF model, where σnoise /σsignal = 0.5 and g gives a membrane time constant of 15 ms. The top row shows the ﬁxed (deterministic) response of the model with σnoise set to zero. Middle: Raster of responses of LNP model, with parameters ﬁt with standard methods from a long run of the L-NLIF model responses to nonrepeating stimuli. Bottom: (Black line) Post-stimulus time histogram (PSTH) of the simulated L-NLIF response. (Gray line) PSTH of the LNP model. Note that the LNP model fails to preserve the ﬁne temporal structure of the spike trains, relative to the L-NLIF model. 001 05 0 )sm( emit however, is that Poisson processes do not accurately capture the statistics of neural spike trains [2, 9, 16, 1]. In particular, the probability of observing a spike is not a functional of the stimulus only; it is also strongly affected by the recent history of spiking. The leaky integrate-and-ﬁre (LIF) model provides a biophysically more realistic spike mechanism with a simple form of spike-history dependence. This model is simple, wellunderstood, and has dynamics that are entirely linear except for a nonlinear “reset” of the membrane potential following a spike. Although this model’s overriding linearity is often emphasized (due to the approximately linear relationship between input current and ﬁring rate, and lack of active conductances), the nonlinear reset has signiﬁcant functional importance for the model’s response properties. In previous work, we have shown that standard reverse correlation analysis fails when applied to a neuron with deterministic (noise-free) LIF spike generation; we developed a new estimator for this model, and demonstrated that a change in leakiness of such a mechanism might underlie nonlinear effects of contrast adaptation in macaque retinal ganglion cells [15]. We and others have explored other “adaptive” properties of the LIF model [17, 13, 19]. In this paper, we consider a model consisting of a linear ﬁlter followed by noisy LIF spike generation with a spike-dependent after-current; this is essentially the standard LIF model driven by a noisy, ﬁltered version of the stimulus, with an additional current waveform injected following each spike. We will refer to this as the the “L-NLIF” model. The probabilistic nature of this model provides several important advantages over the deterministic version we have considered previously. First, an explicit noise model allows us to couch the problem in the terms of classical estimation theory. This, in turn, provides a natural “cost function” (likelihood) for model assessment and leads to more efﬁcient estimation of the model parameters. Second, noise allows us to explicitly model neural ﬁring statistics, and could provide a rigorous basis for a metric distance between spike trains, useful in other contexts [18]. Finally, noise inﬂuences the behavior of the model itself, giving rise to phenomena not observed in the purely deterministic model [11]. Our main contribution here is to show that the maximum likelihood estimator (MLE) for the L-NLIF model is computationally tractable. Speciﬁcally, we describe an algorithm for computing the likelihood function, and prove that this likelihood function contains no non-global maxima, implying that the MLE can be computed efﬁciently using standard ascent techniques. The desirable statistical properties of this estimator (e.g. consistency, efﬁciency) are all inherited “for free” from classical estimation theory. Thus, we have a compact and powerful model for the neural code, and a well-motivated, efﬁcient way to estimate the parameters of this model from extracellular data. The Model We consider a model for which the (dimensionless) subthreshold voltage variable V evolves according to i−1 dV = − gV (t) + k · x(t) + j=0 h(t − tj ) dt + σNt , (1) and resets to Vr whenever V = 1. Here, g denotes the leak conductance, k · x(t) the projection of the input signal x(t) onto the linear kernel k, h is an “afterpotential,” a current waveform of ﬁxed amplitude and shape whose value depends only on the time since the last spike ti−1 , and Nt is an unobserved (hidden) noise process with scale parameter σ. Without loss of generality, the “leak” and “threshold” potential are set at 0 and 1, respectively, so the cell spikes whenever V = 1, and V decays back to 0 with time constant 1/g in the absence of input. Note that the nonlinear behavior of the model is completely determined by only a few parameters, namely {g, σ, Vr }, and h (where the function h is allowed to take values in some low-dimensional vector space). The dynamical properties of this type of “spike response model” have been extensively studied [7]; for example, it is known that this class of models can effectively capture much of the behavior of apparently more biophysically realistic models (e.g. Hodgkin-Huxley). Figures 1 and 2 show several simple comparisons of the L-NLIF and LNP models. In 1, note the ﬁne structure of spike timing in the responses of the L-NLIF model, which is qualitatively similar to in vivo experimental observations [2, 16, 9]). The LNP model fails to capture this ﬁne temporal reproducibility. At the same time, the L-NLIF model is much more ﬂexible and representationally powerful, as demonstrated in Fig. 2: by varying V r or h, for example, we can match a wide variety of dynamical behaviors (e.g. adaptation, bursting, bistability) known to exist in biological neurons. The Estimation Problem Our problem now is to estimate the model parameters {k, σ, g, Vr , h} from a sufﬁciently rich, dynamic input sequence x(t) together with spike times {ti }. A natural choice is the maximum likelihood estimator (MLE), which is easily proven to be consistent and statistically efﬁcient here. To compute the MLE, we need to compute the likelihood and develop an algorithm for maximizing it. The tractability of the likelihood function for this model arises directly from the linearity of the subthreshold dynamics of voltage V (t) during an interspike interval. In the noiseless case [15], the voltage trace during an interspike interval t ∈ [ti−1 , ti ] is given by the solution to equation (1) with σ = 0:   V0 (t) = Vr e−gt + t ti−1 i−1 k · x(s) + j=0 h(s − tj ) e−g(t−s) ds, (2) A stimulus h current responses 0 0 0 1 )ces( t 0 2. 0 t stimulus x 0 B c responses c=1 h current 0 c=2 2. 0 c=5 1 )ces( t t 0 0 stimulus C 0 h current responses Figure 2: Illustration of diverse behaviors of L-NLIF model. A: Firing rate adaptation. A positive DC current (top) was injected into three model cells differing only in their h currents (shown on left: top, h = 0; middle, h depolarizing; bottom, h hyperpolarizing). Voltage traces of each cell’s response (right, with spikes superimposed) exhibit rate facilitation for depolarizing h (middle), and rate adaptation for hyperpolarizing h (bottom). B: Bursting. The response of a model cell with a biphasic h current (left) is shown as a function of the three different levels of DC current. For small current levels (top), the cell responds rhythmically. For larger currents (middle and bottom), the cell responds with regular bursts of spikes. C: Bistability. The stimulus (top) is a positive followed by a negative current pulse. Although a cell with no h current (middle) responds transiently to the positive pulse, a cell with biphasic h (bottom) exhibits a bistable response: the positive pulse puts it into a stable ﬁring regime which persists until the arrival of a negative pulse. 0 0 1 )ces( t 0 5 0. t 0 which is simply a linear convolution of the input current with a negative exponential. It is easy to see that adding Gaussian noise to the voltage during each time step induces a Gaussian density over V (t), since linear dynamics preserve Gaussianity [8]. This density is uniquely characterized by its ﬁrst two moments; the mean is given by (2), and its covariance T is σ 2 Eg Eg , where Eg is the convolution operator corresponding to e−gt . Note that this density is highly correlated for nearby points in time, since noise is integrated by the linear dynamics. Intuitively, smaller leak conductance g leads to stronger correlation in V (t) at nearby time points. We denote this Gaussian density G(xi , k, σ, g, Vr , h), where index i indicates the ith spike and the corresponding stimulus chunk xi (i.e. the stimuli that inﬂuence V (t) during the ith interspike interval). Now, on any interspike interval t ∈ [ti−1 , ti ], the only information we have is that V (t) is less than threshold for all times before ti , and exceeds threshold during the time bin containing ti . This translates to a set of linear constraints on V (t), expressed in terms of the set Ci = ti−1 ≤t < 1 ∩ V (ti ) ≥ 1 . Therefore, the likelihood that the neuron ﬁrst spikes at time ti , given a spike at time ti−1 , is the probability of the event V (t) ∈ Ci , which is given by Lxi ,ti (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), Ci the integral of the Gaussian density G(xi , k, σ, g, Vr , h) over the set Ci . sulumits Figure 3: Behavior of the L-NLIF model during a single interspike interval, for a single (repeated) input current (top). Top middle: Ten simulated voltage traces V (t), evaluated up to the ﬁrst threshold crossing, conditional on a spike at time zero (Vr = 0). Note the strong correlation between neighboring time points, and the sparsening of the plot as traces are eliminated by spiking. Bottom Middle: Time evolution of P (V ). Each column represents the conditional distribution of V at the corresponding time (i.e. for all traces that have not yet crossed threshold). Bottom: Probability density of the interspike interval (isi) corresponding to this particular input. Note that probability mass is concentrated at the points where input drives V0 (t) close to threshold. rhtV secart V 0 rhtV )V(P 0 )isi(P 002 001 )cesm( t 0 0 Spiking resets V to Vr , meaning that the noise contribution to V in different interspike intervals is independent. This “renewal” property, in turn, implies that the density over V (t) for an entire experiment factorizes into a product of conditionally independent terms, where each of these terms is one of the Gaussian integrals derived above for a single interspike interval. The likelihood for the entire spike train is therefore the product of these terms over all observed spikes. Putting all the pieces together, then, the full likelihood is L{xi ,ti } (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), i Ci where the product, again, is over all observed spike times {ti } and corresponding stimulus chunks {xi }. Now that we have an expression for the likelihood, we need to be able to maximize it. Our main result now states, basically, that we can use simple ascent algorithms to compute the MLE without getting stuck in local maxima. Theorem 1. The likelihood L{xi ,ti } (k, σ, g, Vr , h) has no non-global extrema in the parameters (k, σ, g, Vr , h), for any data {xi , ti }. The proof [14] is based on the log-concavity of L{xi ,ti } (k, σ, g, Vr , h) under a certain parametrization of (k, σ, g, Vr , h). The classical approach for establishing the nonexistence of non-global maxima of a given function uses concavity, which corresponds roughly to the function having everywhere non-positive second derivatives. However, the basic idea can be extended with the use of any invertible function: if f has no non-global extrema, neither will g(f ), for any strictly increasing real function g. The logarithm is a natural choice for g in any probabilistic context in which independence plays a role, since sums are easier to work with than products. Moreover, concavity of a function f is strictly stronger than logconcavity, so logconcavity can be a powerful tool even in situations for which concavity is useless (the Gaussian density is logconcave but not concave, for example). Our proof relies on a particular theorem [3] establishing the logconcavity of integrals of logconcave functions, and proceeds by making a correspondence between this type of integral and the integrals that appear in the deﬁnition of the L-NLIF likelihood above. We should also note that the proof extends without difﬁculty to some other noise processes which generate logconcave densities (where white noise has the standard Gaussian density); for example, the proof is nearly identical if Nt is allowed to be colored or nonGaussian noise, with possibly nonzero drift. Computational methods and numerical results Theorem 1 tells us that we can ascend the likelihood surface without fear of getting stuck in local maxima. Now how do we actually compute the likelihood? This is a nontrivial problem: we need to be able to quickly compute (or at least approximate, in a rational way) integrals of multivariate Gaussian densities G over simple but high-dimensional orthants Ci . We discuss two ways to compute these integrals; each has its own advantages. The ﬁrst technique can be termed “density evolution” [10, 13]. The method is based on the following well-known fact from the theory of stochastic differential equations [8]: given the data (xi , ti−1 ), the probability density of the voltage process V (t) up to the next spike ti satisﬁes the following partial differential (Fokker-Planck) equation: ∂P (V, t) σ2 ∂ 2 P ∂[(V − Veq (t))P ] = , +g 2 ∂t 2 ∂V ∂V under the boundary conditions (3) P (V, ti−1 ) = δ(V − Vr ), P (Vth , t) = 0; where Veq (t) is the instantaneous equilibrium potential:   i−1 1 Veq (t) = h(t − tj ) . k · x(t) + g j=0 Moreover, the conditional ﬁring rate f (t) satisﬁes t ti−1 f (s)ds = 1 − P (V, t)dV. Thus standard techniques for solving the drift-diffusion evolution equation (3) lead to a fast method for computing f (t) (as illustrated in Fig. 2). Finally, the likelihood Lxi ,ti (k, σ, g, Vr , h) is simply f (ti ). While elegant and efﬁcient, this density evolution technique turns out to be slightly more powerful than what we need for the MLE: recall that we do not need to compute the conditional rate function f at all times t, but rather just at the set of spike times {ti }, and thus we can turn to more specialized techniques for faster performance. We employ a rapid technique for computing the likelihood using an algorithm due to Genz [6], designed to compute exactly the kinds of multidimensional Gaussian probability integrals considered here. This algorithm works well when the orthants Ci are deﬁned by fewer than ≈ 10 linear constraints on V (t). The number of actual constraints on V (t) during an interspike interval (ti+1 − ti ) grows linearly in the length of the interval: thus, to use this algorithm in typical data situations, we adopt a strategy proposed in our work on the deterministic form of the model [15], in which we discard all but a small subset of the constraints. The key point is that, due to strong correlations in the noise and the fact that the constraints only ﬁgure signiﬁcantly when the V (t) is driven close to threshold, a small number of constraints often sufﬁce to approximate the true likelihood to a high degree of precision. h mitse h eurt K mitse ATS K eurt 0 0 06 )ekips retfa cesm( t 03 0 0 )ekips erofeb cesm( t 001- 002- Figure 4: Demonstration of the estimator’s performance on simulated data. Dashed lines show the true kernel k and aftercurrent h; k is a 12-sample function chosen to resemble the biphasic temporal impulse response of a macaque retinal ganglion cell, while h is function speciﬁed in a ﬁve-dimensional vector space, whose shape induces a slight degree of burstiness in the model’s spike responses. The L-NLIF model was stimulated with parameters g = 0.05 (corresponding to a membrane time constant of 20 time-samples), σ noise = 0.5, and Vr = 0. The stimulus was 30,000 time samples of white Gaussian noise with a standard deviation of 0.5. With only 600 spikes of output, the estimator is able to retrieve an estimate of k (gray curve) which closely matches the true kernel. Note that the spike-triggered average (black curve), which is an unbiased estimator for the kernel of an LNP neuron [5], differs signiﬁcantly from this true kernel (see also [15]). The accuracy of this approach improves with the number of constraints considered, but performance is fastest with fewer constraints. Therefore, because ascending the likelihood function requires evaluating the likelihood at many different points, we can make this ascent process much quicker by applying a version of the coarse-to-ﬁne idea. Let L k denote the approximation to the likelihood given by allowing only k constraints in the above algorithm. Then we know, by a proof identical to that of Theorem 1, that Lk has no local maxima; in addition, by the above logic, Lk → L as k grows. It takes little additional effort to prove that argmax Lk → argmax L; thus, we can efﬁciently ascend the true likelihood surface by ascending the “coarse” approximants Lk , then gradually “reﬁning” our approximation by letting k increase. An application of this algorithm to simulated data is shown in Fig. 4. Further applications to both simulated and real data will be presented elsewhere. Discussion We have shown here that the L-NLIF model, which couples a linear ﬁltering stage to a biophysically plausible and ﬂexible model of neuronal spiking, can be efﬁciently estimated from extracellular physiological data using maximum likelihood. Moreover, this model lends itself directly to analysis via tools from the modern theory of point processes. For example, once we have obtained our estimate of the parameters (k, σ, g, Vr , h), how do we verify that the resulting model provides an adequate description of the data? This important “model validation” question has been the focus of some recent elegant research, under the rubric of “time rescaling” techniques [4]. While we lack the room here to review these methods in detail, we can note that they depend essentially on knowledge of the conditional ﬁring rate function f (t). Recall that we showed how to efﬁciently compute this function in the last section and examined some of its qualitative properties in the L-NLIF context in Figs. 2 and 3. We are currently in the process of applying the model to physiological data recorded both in vivo and in vitro, in order to assess whether it accurately accounts for the stimulus preferences and spiking statistics of real neurons. One long-term goal of this research is to elucidate the different roles of stimulus-driven and stimulus-independent activity on the spiking patterns of both single cells and multineuronal ensembles. References [1] B. Aguera y Arcas and A. Fairhall. What causes a neuron to spike? 15:1789–1807, 2003. Neral Computation, [2] M. Berry and M. Meister. Refractoriness and neural precision. Journal of Neuroscience, 18:2200–2211, 1998. [3] V. Bogachev. Gaussian Measures. AMS, New York, 1998. [4] E. Brown, R. Barbieri, V. Ventura, R. Kass, and L. Frank. The time-rescaling theorem and its application to neural spike train data analysis. Neural Computation, 14:325–346, 2002. [5] E. Chichilnisky. A simple white noise analysis of neuronal light responses. Network: Computation in Neural Systems, 12:199–213, 2001. [6] A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. [7] W. Gerstner and W. Kistler. Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cambridge University Press, 2002. [8] S. Karlin and H. Taylor. A Second Course in Stochastic Processes. Academic Press, New York, 1981. [9] J. Keat, P. Reinagel, R. Reid, and M. Meister. Predicting every spike: a model for the responses of visual neurons. Neuron, 30:803–817, 2001. [10] B. Knight, A. Omurtag, and L. Sirovich. The approach of a neuron population ﬁring rate to a new equilibrium: an exact theoretical result. Neural Computation, 12:1045–1055, 2000. [11] J. Levin and J. Miller. Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380:165–168, 1996. [12] L. Paninski. Convergence properties of some spike-triggered analysis techniques. Network: Computation in Neural Systems, 14:437–464, 2003. [13] L. Paninski, B. Lau, and A. Reyes. Noise-driven adaptation: in vitro and mathematical analysis. Neurocomputing, 52:877–883, 2003. [14] L. Paninski, J. Pillow, and E. Simoncelli. Maximum likelihood estimation of a stochastic integrate-and-ﬁre neural encoding model. submitted manuscript (cns.nyu.edu/∼liam), 2004. [15] J. Pillow and E. Simoncelli. Biases in white noise analysis due to non-poisson spike generation. Neurocomputing, 52:109–115, 2003. [16] D. Reich, J. Victor, and B. Knight. The power ratio and the interval map: Spiking models and extracellular recordings. The Journal of Neuroscience, 18:10090–10104, 1998. [17] M. Rudd and L. Brown. Noise adaptation in integrate-and-ﬁre neurons. Neural Computation, 9:1047–1069, 1997. [18] J. Victor. How the brain uses time to represent and process visual information. Brain Research, 886:33–46, 2000. [19] Y. Yu and T. Lee. Dynamical mechanisms underlying contrast gain control in sing le neurons. Physical Review E, 68:011901, 2003.

4 0.10751586 51 nips-2003-Design of Experiments via Information Theory

Author: Liam Paninski

Abstract: We discuss an idea for collecting data in a relatively efﬁcient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantiﬁes the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efﬁciency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efﬁciency on the form of the “psychometric function” underlying the output responses. 1

5 0.10650753 124 nips-2003-Max-Margin Markov Networks

Author: Ben Taskar, Carlos Guestrin, Daphne Koller

Abstract: In typical classiﬁcation tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of conﬁdence of the classiﬁer, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efﬁciently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efﬁcient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classiﬁcation demonstrate very signiﬁcant gains over previous approaches. 1

6 0.096717246 128 nips-2003-Minimax Embeddings

7 0.09548606 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty

8 0.09234646 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

9 0.089639157 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence

10 0.089405537 117 nips-2003-Linear Response for Approximate Inference

11 0.087950639 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

12 0.086727649 122 nips-2003-Margin Maximizing Loss Functions

13 0.084540725 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

14 0.082868129 100 nips-2003-Laplace Propagation

15 0.082451783 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

16 0.081852391 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

17 0.078255937 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

18 0.076211572 66 nips-2003-Extreme Components Analysis

19 0.071811698 92 nips-2003-Information Bottleneck for Gaussian Variables

20 0.068527006 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.235), (1, -0.089), (2, -0.017), (3, 0.02), (4, 0.081), (5, 0.112), (6, 0.089), (7, -0.137), (8, -0.021), (9, -0.083), (10, -0.04), (11, 0.119), (12, -0.001), (13, 0.006), (14, -0.139), (15, 0.043), (16, 0.119), (17, -0.228), (18, 0.117), (19, 0.094), (20, 0.026), (21, 0.146), (22, -0.039), (23, 0.021), (24, 0.003), (25, -0.007), (26, 0.073), (27, -0.05), (28, 0.029), (29, -0.028), (30, 0.023), (31, -0.076), (32, -0.212), (33, 0.053), (34, -0.104), (35, -0.146), (36, -0.066), (37, 0.027), (38, 0.118), (39, 0.166), (40, -0.003), (41, 0.09), (42, -0.049), (43, 0.157), (44, 0.096), (45, -0.043), (46, 0.035), (47, -0.123), (48, 0.124), (49, -0.11)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97104108 115 nips-2003-Linear Dependent Dimensionality Reduction

Author: Nathan Srebro, Tommi S. Jaakkola

2 0.88873833 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

Author: Yoshua Bengio, Yves Grandvalet

3 0.51791632 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

Author: Liam Paninski, Eero P. Simoncelli, Jonathan W. Pillow

4 0.49108058 100 nips-2003-Laplace Propagation

Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1

5 0.40909803 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

Author: Kenji Fukumizu, Francis R. Bach, Michael I. Jordan

Abstract: We propose a novel method of dimensionality reduction for supervised learning. Given a regression or classiﬁcation problem in which we wish to predict a variable Y from an explanatory vector X, we treat the problem of dimensionality reduction as that of ﬁnding a low-dimensional “effective subspace” of X which retains the statistical relationship between X and Y . We show that this problem can be formulated in terms of conditional independence. To turn this formulation into an optimization problem, we characterize the notion of conditional independence using covariance operators on reproducing kernel Hilbert spaces; this allows us to derive a contrast function for estimation of the effective subspace. Unlike many conventional methods, the proposed method requires neither assumptions on the marginal distribution of X, nor a parametric model of the conditional distribution of Y . 1

6 0.38454092 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

7 0.38432711 128 nips-2003-Minimax Embeddings

8 0.38327992 66 nips-2003-Extreme Components Analysis

9 0.3683444 51 nips-2003-Design of Experiments via Information Theory

10 0.3652449 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence

11 0.36189258 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

12 0.3384237 124 nips-2003-Max-Margin Markov Networks

13 0.3235178 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks

14 0.31549323 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms

15 0.3151893 55 nips-2003-Distributed Optimization in Adaptive Networks

16 0.30905962 184 nips-2003-The Diffusion-Limited Biochemical Signal-Relay Channel

17 0.30564266 102 nips-2003-Large Scale Online Learning

18 0.29966432 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

19 0.29158765 48 nips-2003-Convex Methods for Transduction

20 0.29119846 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.033), (11, 0.014), (29, 0.02), (30, 0.032), (35, 0.061), (46, 0.222), (53, 0.135), (66, 0.035), (69, 0.03), (71, 0.071), (76, 0.055), (85, 0.082), (91, 0.11), (99, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87719357 115 nips-2003-Linear Dependent Dimensionality Reduction

Author: Nathan Srebro, Tommi S. Jaakkola

2 0.69498396 107 nips-2003-Learning Spectral Clustering

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1

3 0.69216835 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

Author: Kenji Fukumizu, Francis R. Bach, Michael I. Jordan

4 0.69176346 126 nips-2003-Measure Based Regularization

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1

5 0.68638414 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

6 0.68570483 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

7 0.68525058 113 nips-2003-Learning with Local and Global Consistency

8 0.68461835 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

9 0.68227392 30 nips-2003-Approximability of Probability Distributions

10 0.68087989 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

11 0.68074 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

12 0.67999095 143 nips-2003-On the Dynamics of Boosting

13 0.67938805 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

14 0.67932916 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

15 0.67814243 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

16 0.67797202 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

17 0.67759454 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

18 0.67719573 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

19 0.67682195 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

20 0.67559284 188 nips-2003-Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects