nips nips2003 nips2003-137 knowledge-graph by maker-knowledge-mining

137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation


Source: pdf

Author: Yoshua Bengio, Yves Grandvalet

Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare algorithm performances. In order to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the estimation of uncertainty around the K-fold cross-validation estimator. The main theorem shows that there exists no universal unbiased estimator of the variance of K-fold cross-validation. An analysis based on the eigendecomposition of the covariance matrix of errors helps to better understand the nature of the problem and shows that naive estimators may grossly underestimate variance, as con£rmed by numerical experiments. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract Most machine learning researchers perform quantitative experiments to estimate generalization error and compare algorithm performances. [sent-6, score-0.083]

2 In order to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. [sent-7, score-0.058]

3 The main theorem shows that there exists no universal unbiased estimator of the variance of K-fold cross-validation. [sent-9, score-0.723]

4 An analysis based on the eigendecomposition of the covariance matrix of errors helps to better understand the nature of the problem and shows that naive estimators may grossly underestimate variance, as con£rmed by numerical experiments. [sent-10, score-0.34]

5 Learning algorithms themselves are often compared on their average performance, which estimates expected value of prediction error (EPE) over training sets. [sent-14, score-0.215]

6 If the amount of data is large enough, PE can be estimated by the mean error over a hold-out test set. [sent-15, score-0.089]

7 The hold-out technique does not account for the variance with respect to the training set, and may thus be considered inappropriate for the purpose of algorithm comparison [4]. [sent-16, score-0.314]

8 In this situation, one resorts to computer intensive resampling methods such as cross-validation or bootstrap to estimate PE or EPE. [sent-18, score-0.107]

9 While it is known that cross-validation provides an unbiased estimate of EPE, it is also known that its variance may be very large [2]. [sent-20, score-0.496]

10 This variance should be estimated to provide faithful con£dence intervals on PE or EPE, and to test the signi£cance of observed differences between algorithms. [sent-21, score-0.281]

11 The dif£culties of the variance estimation have already been addressed [4, 7, 8]. [sent-23, score-0.243]

12 This paper builds upon the work of Nadeau and Bengio [8], which investigated in detail the theoretical and practical merits of several estimators of the variance of cross-validation. [sent-25, score-0.369]

13 While [8] considers K independent training and test splits, we focus on the standard K-fold cross- validation procedure, with no overlap between test sets: each example is used once and only once as a test example. [sent-27, score-0.305]

14 2 General Framework Formally, we have a training set D = {z1 , . [sent-28, score-0.074]

15 , zn }, with zi ∈ Z, assumed independently sampled from an unknown distribution P . [sent-31, score-0.249]

16 A is insensitive to the ordering of examples in the training set D. [sent-35, score-0.111]

17 Let f = A(D) be the function returned by algorithm A on the training set D. [sent-38, score-0.074]

18 In application-based evaluation, the goal of learning is usually stated as the minimization of the expected loss of f = A(D) on future test examples: PE(D) = E[L(f, z)] , (1) where the expectation is taken with respect to z ∼ P . [sent-39, score-0.164]

19 To evaluate and compare learning algorithms [4] we care about the expected performance of learning algorithm A over different training sets: EPE(n) = E[L(A(D), z)] , (2) where the expectation is taken with respect to D × z independently sampled from P n × P . [sent-40, score-0.171]

20 Although this point is often overlooked, estimating the variance of the estimates PE and EPE requires caution, as illustrated here. [sent-42, score-0.306]

21 1 Hold-out estimates of performance The mean error over a hold-out test set estimates PE, and the variance of PE is given by the usual variance estimate for means of independent variables. [sent-44, score-0.736]

22 However, this variance estimator is not suited to EPE: the test errors are correlated when the training set is considered as a random variable. [sent-45, score-0.656]

23 The average ratio (estimator of variance/empirical variance) is displayed for two variance estimators, in an ideal situation where 10 independent training and test sets are available. [sent-47, score-0.506]

24 The average of θ1 /θ, the naive variance estimator ignoring correlations, shows that this estimate is highly down-biased, even for large sample sizes. [sent-48, score-0.631]

25 The error bars represent ±2 standard errors on the average value. [sent-54, score-0.121]

26 , DK of n independent examples zi = (xi , yi ), where xi = (xi1 , . [sent-59, score-0.299]

27 , xid ) is a d-dimensional centered, unit covariance d Gaussian variable (d = 30), yi = 3/d k=1 xik + εi with εi being independent, centered, unit variance Gaussian variables (the 3/d factor provides R2 3/4). [sent-62, score-0.401]

28 The learning algorithm consists in £tting a line by ordinary least squares, and the ¯ estimate of EPE is the average quadratic loss on test examples EPE = L = K 1 1 k=1 n zi ∈Tk Lki , where Lki = L(A(Dk ), zi ). [sent-67, score-0.712]

29 K K 1 ¯ The £rst estimate of variance of EPE is θ1 = (Lki − L)2 , which Kn(Kn−1) k=1 i is unbiased provided there is no correlation between test errors. [sent-68, score-0.56]

30 The second estimate is K 1 ¯ ¯ θ2 = K(K−1)n2 k=1 i,j (Lki − L)(Lkj − L), which estimates correlations. [sent-69, score-0.116]

31 Note that Figure 1 suggests that the naive estimator of variance θ1 asymptotically converges to the true variance. [sent-70, score-0.499]

32 provided that the only source of variability of EPE is due to the £nite test size. [sent-75, score-0.064]

33 2 K-fold cross-validation estimates of performance In K-fold cross-validation [9], the data set D is £rst chunked into K disjoint subsets (or blocks) of the same size m = n/K (to simplify the analysis below we assume that n is a multiple of K). [sent-77, score-0.058]

34 Let us write Tk for the k-th such block, and Dk the training set obtained by removing the elements in Tk from D. [sent-78, score-0.074]

35 The estimator is CV = 1 K K k=1 1 m L(A(Dk ), zi ) . [sent-79, score-0.452]

36 (3) zi ∈Tk Under stability assumptions on A, CV estimates PE(D) at least as accurately as the training error [6]. [sent-80, score-0.405]

37 However, as CV is an average of unbiased estimates of PE(D 1 ), . [sent-81, score-0.303]

38 , PE(DK ), a more general statement is that CV estimates unbiasedly EPE(n−m). [sent-84, score-0.085]

39 Note that the forthcoming analysis also applies to the version of cross-validation dedicated to comparing algorithms, using matched pairs ∆CV = 1 K K k=1 1 m L(A1 (Dk ), zi ) − L(A2 (Dk ), zi ) , zi ∈Tk and to the delete-m jackknife estimate of PE(D) debiasing the training error (see e. [sent-85, score-0.867]

40 [5]): n JK = 1 1 L(A(D), zi )−(K−1) n i=1 K(n − m) K n L(A(Dk ), zi ) − k=1 zi ∈Dk 1 L(A(D), zi ) . [sent-87, score-0.892]

41 ˆ Note that µ is the average of identically distributed (dependent) variables. [sent-89, score-0.118]

42 Thus, it asympˆ totically converges to a normally distributed variable, which is completely characterized by its expectation E[ˆ] and its variance Var[ˆ]. [sent-90, score-0.311]

43 µ µ 3 Structure of the Covariance Matrix 1 The variance of µ is θ = n2 i,j Cov(ei , ej ) . [sent-91, score-0.364]

44 By using symmetry over permutations of ˆ the examples in D, we show that the covariance matrix has a simple block structure. [sent-92, score-0.268]

45 Corollary 1 The covariance matrix Σ of cross-validation errors e = (e1 , . [sent-96, score-0.159]

46 n m Figure 2: Structure of the covariance matrix. [sent-100, score-0.087]

47 the variance σ 2 is the average (taken over training sets) variance of errors for “true” test examples (i. [sent-103, score-0.705]

48 sampled independently from the training sets) when algorithm A is fed with training sets of size m(K − 1); 2. [sent-105, score-0.21]

49 the within-block covariance ω would also apply to these “true” test examples; it arises from the dependence of test errors stemming from the common training set. [sent-106, score-0.361]

50 the between-blocks covariance γ is due to the dependence of training sets (which share n(K − 2)/K examples) and the fact that test block Tk appears in all the training sets D for = k. [sent-108, score-0.459]

51 4 No Unbiased Estimator of Var[ˆ] Exists µ ˆ Consider a generic estimator θ that depends on the sequence of cross-validation errors ˆ e = (e1 , e2 , . [sent-109, score-0.301]

52 Assuming θ is analytic in e, consider its Taylor expansion: ˆ θ = α0 + α1 (i)ei + α2 (i, j)ei ej + α3 (i, j, k)ei ej ek + . [sent-113, score-0.294]

53 (5) i i,j i,j,k ˆ We £rst show that for unbiased variance estimates (i. [sent-116, score-0.496]

54 Lemma 2 There is no universal unbiased estimator of Var[ˆ] that involves the e i in a µ non-quadratic way. [sent-119, score-0.477]

55 µ Since estimators that include moments other than the second moments in their expectation are biased, we now focus on estimators which are quadratic forms of the errors, i. [sent-121, score-0.477]

56 Theorem 1 There exists no universally unbiased estimator of Var[ˆ]. [sent-127, score-0.495]

57 µ ˆ Proof: thanks to Lemma 2 and 3, it is enough to show that E[θ] = Var[ˆ] has no solution µ for quadratic estimators: 1 m−1 n−m ˆ E[θ] = Var[ˆ] ⇔ a(σ 2 + µ2 ) + b(ω + µ2 ) + c(γ + µ2 ) = σ 2 + µ ω+ γ . [sent-128, score-0.079]

58 a+b+c = 0 5 Eigenanalysis of the covariance matrix One way to gain insight on the origin of the negative statement of Theorem 1 is via the eigenanalysis of Σ, the covariance of e. [sent-132, score-0.279]

59 This decomposition can be performed analytically thanks to the very particular block structure displayed in Figure 2. [sent-133, score-0.142]

60 Lemma 4 Let vk be the binary vector indicating the membership of each example to test block k. [sent-134, score-0.228]

61 The eigenvalues of Σ are as follows: • λ1 = σ 2 − ω with multiplicity n − K and eigenspace orthogonal to {vk }K ; k=1 • λ2 = σ 2 + (m − 1)ω − mγ with multiplicity K − 1 and eigenspace de£ned in the orthogonal of 1 by the basis {vk }K ; k=1 • λ3 = σ 2 + (m − 1)ω + (n − m)γ with eigenvector 1. [sent-135, score-0.315]

62 Lemma 4 states that the vector e can be decomposed into three uncorrelated parts: n − K projections to the subspace orthogonal to {vk }K , K − 1 projections to the subspace k=1 spanned by {vk }K in the orthogonal of 1, and one projection on 1. [sent-136, score-0.22]

63 k=1 A single vector example with n independent elements can be seen as n independent examples. [sent-137, score-0.078]

64 Similarly, the uncorrelated projections of e can be equivalently represented by respectively n − K, K − 1 and one uncorrelated one-dimensional examples. [sent-138, score-0.115]

65 In particular, for the projection on 1, with a single example, the sample variance is null, resulting in the absence of unbiased variance estimator of λ3 . [sent-139, score-0.981]

66 Hence there is no unbiased estimate of V ar[ˆ] = λ3 ˆ µ n when we have only one realization of the vector e. [sent-141, score-0.324]

67 6 Possible values for ω and γ Theorem 1 states that no estimator is unbiased, and in its demonstration, it is shown that the bias of any quadratic estimator is a linear combination of µ2 , σ 2 , ω and γ. [sent-146, score-0.555]

68 Hence, we cannot propose a variance estimate with universally small bias. [sent-150, score-0.32]

69 7 Experiments The bias of any quadratic estimator is a linear combination of µ2 , σ 2 , ω and γ. [sent-151, score-0.326]

70 The admissible values provided earlier suggest that ω and γ cannot be proved to be negligible compared to σ 2 . [sent-152, score-0.077]

71 This section illustrates that in practice, the contribution to the variance of µ due to ω and γ (see Equation (4)) can be of same order as the one due σ 2 . [sent-153, score-0.256]

72 This con£rms ˆ that the estimators of θ should indeed take into account the correlations of ei . [sent-154, score-0.481]

73 , xid ) is still 30-dimensional, but it is now a mixture of two centered Gaussian: let ti be a binary variable, with P (ti = 1) = p = 0. [sent-160, score-0.15]

74 95; ti = 1 ⇒ xi ∼ d N (0, I), ti = 0 ⇒ xi ∼ N (0, 100I); yi = 3/(d(p + 100(1 − p))) k=1 xik + εi ; ti = 1 ⇒ εi ∼ N (0, 1/(p + 100(1 − p))), ti = 0 ⇒ εi ∼ N (0, 100/(p + 100(1 − p))). [sent-161, score-0.329]

75 We now look at the variance of K-fold cross-validation (K = 10), and decompose in the three orthogonal components σ 2 , ω and γ. [sent-162, score-0.265]

76 05 0 60 0 80 100 120 160 220 280 360 460 600 n−m 60 80 100 120 160 220 280 360 460 600 n−m no outliers outliers Figure 3: Contributions of (σ 2 , ω, γ) to total variance V ar[CV ] vs. [sent-169, score-0.439]

77 Without outliers, the contribution of γ is very important for small sample sizes. [sent-171, score-0.078]

78 For large sample sizes, the overall variance is considerably reduced and is mainly caused by σ 2 because the learning algorithm returns very similar answers for all training sets. [sent-172, score-0.33]

79 When there are outliers, the contribution of γ is of same order as the one of σ 2 even when the ratio of examples to free parameters is large (here up to 20). [sent-173, score-0.076]

80 Thus, in dif£cult situations, where A(D) varies according to the realization of D, neglecting the effect of ω and γ can be expected to introduce a bias of the order of the true variance. [sent-174, score-0.122]

81 The decomposition of θ in σ 2 , ω and γ (4) does not imply that K should be set either to n or to 2 (according to the sign of ω −γ) in order to minimize the variance of µ. [sent-176, score-0.217]

82 Modifying ˆ K affects σ 2 , ω and γ through the size and overlaps of the training sets D1 , . [sent-177, score-0.11]

83 For a £xed sample size, the variance of µ and the contribution of σ 2 , ˆ ω and γ vary smoothly with K (of course, the mean of µ is also affected in the process). [sent-181, score-0.295]

84 5 0 2 3 4 5 6 8 10 12 15 20 24 30 40 60120 K no outliers 0 2 3 4 5 6 8 10 12 15 20 24 30 40 60120 K outliers Figure 4: Contributions of (σ 2 , ω, γ) to total variance V ar[CV ] vs. [sent-190, score-0.439]

85 First, when having K independent training and test sets (K = 1 is the realistic case), the structure of hold-out errors resemble the one of cross-validation errors, with γ = 0. [sent-193, score-0.285]

86 Knowing that allows to build the unbiased estimate θ2 given in 2. [sent-194, score-0.279]

87 1: knowing that γ = 0 removes the third equation of system (10) in the proof of Theorem 1. [sent-195, score-0.059]

88 It is a special case of K-fold cross-validation where the training blocks are mutually independent since they do not overlap. [sent-197, score-0.181]

89 The between-block correlation stems from the fact that the training block D1 is the test block T2 and vice-versa. [sent-199, score-0.314]

90 The structure of the covariance matrix is simpli£ed, without diagonal blocks. [sent-201, score-0.087]

91 The estimation dif£culties however remain: even in this particular case, there is no unbiased estimate of variance. [sent-202, score-0.305]

92 To summarize, it is known that K-fold cross-validation may suffer from high variability, which can be responsible for bad choices in model selection and erratic behavior in the estimated expected prediction error [2, 4, 8]. [sent-204, score-0.059]

93 This paper demonstrates that estimating the variance of K-fold cross-validation is dif£cult. [sent-205, score-0.248]

94 Not only there is no unbiased estimate of this variance, but we have no theoretical result showing that this bias should be negligible in the non-asymptotical regime. [sent-206, score-0.352]

95 The eigenanalysis of the covariance matrix of errors traces the problem back to the dependencies between test-block errors, which induce the absence of redundant pieces of information regarding the average test error. [sent-207, score-0.383]

96 It is clear that this absence of redundancy is bound to provide dif£culties in the estimation of variance. [sent-211, score-0.061]

97 Our experiments show that the bias incurred by ignoring test errors dependencies can be of the order of the variance itself, even for large sample sizes. [sent-212, score-0.493]

98 The next step of this study consists in building and comparing variance estimators dedicated to the very speci£c structure of the test-block error dependencies. [sent-214, score-0.435]

99 No unbiased estimator of the variance of K-fold cross-validation. [sent-218, score-0.667]

100 A study of cross-validation and bootstrap for accuracy estimation and model selection. [sent-248, score-0.075]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('epe', 0.366), ('tk', 0.313), ('ei', 0.28), ('cv', 0.269), ('pe', 0.241), ('estimator', 0.229), ('zi', 0.223), ('unbiased', 0.221), ('variance', 0.217), ('dk', 0.206), ('var', 0.155), ('estimators', 0.152), ('ej', 0.147), ('outliers', 0.111), ('lki', 0.104), ('block', 0.088), ('covariance', 0.087), ('cov', 0.082), ('lemma', 0.079), ('eigenanalysis', 0.078), ('vk', 0.076), ('training', 0.074), ('errors', 0.072), ('ti', 0.071), ('blocks', 0.068), ('test', 0.064), ('identically', 0.061), ('estimate', 0.058), ('estimates', 0.058), ('jk', 0.054), ('quadratic', 0.054), ('eigenspace', 0.052), ('nadeau', 0.052), ('xid', 0.052), ('ar', 0.052), ('bootstrap', 0.049), ('correlations', 0.049), ('orthogonal', 0.048), ('admissible', 0.047), ('universally', 0.045), ('xik', 0.045), ('realization', 0.045), ('multiplicity', 0.045), ('bengio', 0.044), ('uncorrelated', 0.043), ('bias', 0.043), ('moments', 0.041), ('dedicated', 0.041), ('corollary', 0.04), ('contribution', 0.039), ('sample', 0.039), ('independent', 0.039), ('culties', 0.038), ('expectation', 0.037), ('examples', 0.037), ('sets', 0.036), ('ignoring', 0.035), ('wij', 0.035), ('absence', 0.035), ('expected', 0.034), ('distributed', 0.033), ('kn', 0.033), ('permutations', 0.033), ('estimating', 0.031), ('proof', 0.031), ('instability', 0.03), ('negligible', 0.03), ('en', 0.03), ('displayed', 0.029), ('naive', 0.029), ('theorem', 0.029), ('loss', 0.029), ('projections', 0.029), ('cance', 0.028), ('dif', 0.028), ('knowing', 0.028), ('assessment', 0.028), ('centered', 0.027), ('universal', 0.027), ('statement', 0.027), ('sampled', 0.026), ('splits', 0.026), ('estimation', 0.026), ('belonging', 0.025), ('eigenvector', 0.025), ('error', 0.025), ('stability', 0.025), ('thanks', 0.025), ('average', 0.024), ('jointly', 0.024), ('contributions', 0.024), ('converges', 0.024), ('ideal', 0.023), ('symmetry', 0.023), ('dependencies', 0.023), ('projection', 0.023), ('departs', 0.023), ('fourteenth', 0.023), ('montreal', 0.023), ('inappropriate', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

Author: Yoshua Bengio, Yves Grandvalet

Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare algorithm performances. In order to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the estimation of uncertainty around the K-fold cross-validation estimator. The main theorem shows that there exists no universal unbiased estimator of the variance of K-fold cross-validation. An analysis based on the eigendecomposition of the covariance matrix of errors helps to better understand the nature of the problem and shows that naive estimators may grossly underestimate variance, as con£rmed by numerical experiments. 1

2 0.24903053 115 nips-2003-Linear Dependent Dimensionality Reduction

Author: Nathan Srebro, Tommi S. Jaakkola

Abstract: We formulate linear dimensionality reduction as a semi-parametric estimation problem, enabling us to study its asymptotic behavior. We generalize the problem beyond additive Gaussian noise to (unknown) nonGaussian additive noise, and to unbiased non-additive models. 1

3 0.12516627 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty

Author: Harald Steck, Tommi S. Jaakkola

Abstract: The bootstrap has become a popular method for exploring model (structure) uncertainty. Our experiments with artificial and realworld data demonstrate that the graphs learned from bootstrap samples can be severely biased towards too complex graphical models. Accounting for this bias is hence essential, e.g., when exploring model uncertainty. We find that this bias is intimately tied to (well-known) spurious dependences induced by the bootstrap. The leading-order bias-correction equals one half of Akaike’s penalty for model complexity. We demonstrate the effect of this simple bias-correction in our experiments. We also relate this bias to the bias of the plug-in estimator for entropy, as well as to the difference between the expected test and training errors of a graphical model, which asymptotically equals Akaike’s penalty (rather than one half). 1

4 0.092578717 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

Author: Dörthe Malzahn, Manfred Opper

Abstract: We compute approximate analytical bootstrap averages for support vector classification using a combination of the replica method of statistical physics and the TAP approach for approximate inference. We test our method on a few datasets and compare it with exact averages obtained by extensive Monte-Carlo sampling. 1

5 0.081023924 100 nips-2003-Laplace Propagation

Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1

6 0.07614173 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

7 0.074965701 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

8 0.067793511 51 nips-2003-Design of Experiments via Information Theory

9 0.061206222 121 nips-2003-Log-Linear Models for Label Ranking

10 0.05634379 55 nips-2003-Distributed Optimization in Adaptive Networks

11 0.055369254 102 nips-2003-Large Scale Online Learning

12 0.051199365 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

13 0.050832514 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression

14 0.050743796 111 nips-2003-Learning the k in k-means

15 0.050251078 81 nips-2003-Geometric Analysis of Constrained Curves

16 0.049121808 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

17 0.047806159 88 nips-2003-Image Reconstruction by Linear Programming

18 0.047412749 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

19 0.047395892 107 nips-2003-Learning Spectral Clustering

20 0.047074202 151 nips-2003-PAC-Bayesian Generic Chaining


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.175), (1, -0.051), (2, -0.017), (3, -0.005), (4, 0.068), (5, 0.093), (6, 0.056), (7, -0.089), (8, -0.042), (9, -0.009), (10, -0.064), (11, 0.082), (12, -0.018), (13, 0.031), (14, -0.016), (15, 0.007), (16, 0.079), (17, -0.285), (18, 0.156), (19, 0.112), (20, 0.069), (21, 0.104), (22, -0.013), (23, -0.063), (24, -0.043), (25, 0.033), (26, 0.041), (27, 0.01), (28, 0.051), (29, -0.013), (30, 0.045), (31, -0.059), (32, -0.133), (33, 0.038), (34, -0.152), (35, -0.104), (36, -0.047), (37, 0.063), (38, 0.19), (39, 0.107), (40, -0.015), (41, 0.039), (42, -0.02), (43, 0.181), (44, 0.11), (45, -0.039), (46, 0.071), (47, -0.155), (48, 0.134), (49, -0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9639101 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

Author: Yoshua Bengio, Yves Grandvalet

Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare algorithm performances. In order to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the estimation of uncertainty around the K-fold cross-validation estimator. The main theorem shows that there exists no universal unbiased estimator of the variance of K-fold cross-validation. An analysis based on the eigendecomposition of the covariance matrix of errors helps to better understand the nature of the problem and shows that naive estimators may grossly underestimate variance, as con£rmed by numerical experiments. 1

2 0.83169717 115 nips-2003-Linear Dependent Dimensionality Reduction

Author: Nathan Srebro, Tommi S. Jaakkola

Abstract: We formulate linear dimensionality reduction as a semi-parametric estimation problem, enabling us to study its asymptotic behavior. We generalize the problem beyond additive Gaussian noise to (unknown) nonGaussian additive noise, and to unbiased non-additive models. 1

3 0.47876272 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

Author: Liam Paninski, Eero P. Simoncelli, Jonathan W. Pillow

Abstract: Recent work has examined the estimation of models of stimulus-driven neural activity in which some linear filtering process is followed by a nonlinear, probabilistic spiking stage. We analyze the estimation of one such model for which this nonlinear step is implemented by a noisy, leaky, integrate-and-fire mechanism with a spike-dependent aftercurrent. This model is a biophysically plausible alternative to models with Poisson (memory-less) spiking, and has been shown to effectively reproduce various spiking statistics of neurons in vivo. However, the problem of estimating the model from extracellular spike train data has not been examined in depth. We formulate the problem in terms of maximum likelihood estimation, and show that the computational problem of maximizing the likelihood is tractable. Our main contribution is an algorithm and a proof that this algorithm is guaranteed to find the global optimum with reasonable speed. We demonstrate the effectiveness of our estimator with numerical simulations. A central issue in computational neuroscience is the characterization of the functional relationship between sensory stimuli and neural spike trains. A common model for this relationship consists of linear filtering of the stimulus, followed by a nonlinear, probabilistic spike generation process. The linear filter is typically interpreted as the neuron’s “receptive field,” while the spiking mechanism accounts for simple nonlinearities like rectification and response saturation. Given a set of stimuli and (extracellularly) recorded spike times, the characterization problem consists of estimating both the linear filter and the parameters governing the spiking mechanism. One widely used model of this type is the Linear-Nonlinear-Poisson (LNP) cascade model, in which spikes are generated according to an inhomogeneous Poisson process, with rate determined by an instantaneous (“memoryless”) nonlinear function of the filtered input. This model has a number of desirable features, including conceptual simplicity and computational tractability. Additionally, reverse correlation analysis provides a simple unbiased estimator for the linear filter [5], and the properties of estimators (for both the linear filter and static nonlinearity) have been thoroughly analyzed, even for the case of highly non-symmetric or “naturalistic” stimuli [12]. One important drawback of the LNP model, * JWP and LP contributed equally to this work. We thank E.J. Chichilnisky for helpful discussions. L−NLIF model LNP model )ekips(P Figure 1: Simulated responses of LNLIF and LNP models to 20 repetitions of a fixed 100-ms stimulus segment of temporal white noise. Top: Raster of responses of L-NLIF model, where σnoise /σsignal = 0.5 and g gives a membrane time constant of 15 ms. The top row shows the fixed (deterministic) response of the model with σnoise set to zero. Middle: Raster of responses of LNP model, with parameters fit with standard methods from a long run of the L-NLIF model responses to nonrepeating stimuli. Bottom: (Black line) Post-stimulus time histogram (PSTH) of the simulated L-NLIF response. (Gray line) PSTH of the LNP model. Note that the LNP model fails to preserve the fine temporal structure of the spike trains, relative to the L-NLIF model. 001 05 0 )sm( emit however, is that Poisson processes do not accurately capture the statistics of neural spike trains [2, 9, 16, 1]. In particular, the probability of observing a spike is not a functional of the stimulus only; it is also strongly affected by the recent history of spiking. The leaky integrate-and-fire (LIF) model provides a biophysically more realistic spike mechanism with a simple form of spike-history dependence. This model is simple, wellunderstood, and has dynamics that are entirely linear except for a nonlinear “reset” of the membrane potential following a spike. Although this model’s overriding linearity is often emphasized (due to the approximately linear relationship between input current and firing rate, and lack of active conductances), the nonlinear reset has significant functional importance for the model’s response properties. In previous work, we have shown that standard reverse correlation analysis fails when applied to a neuron with deterministic (noise-free) LIF spike generation; we developed a new estimator for this model, and demonstrated that a change in leakiness of such a mechanism might underlie nonlinear effects of contrast adaptation in macaque retinal ganglion cells [15]. We and others have explored other “adaptive” properties of the LIF model [17, 13, 19]. In this paper, we consider a model consisting of a linear filter followed by noisy LIF spike generation with a spike-dependent after-current; this is essentially the standard LIF model driven by a noisy, filtered version of the stimulus, with an additional current waveform injected following each spike. We will refer to this as the the “L-NLIF” model. The probabilistic nature of this model provides several important advantages over the deterministic version we have considered previously. First, an explicit noise model allows us to couch the problem in the terms of classical estimation theory. This, in turn, provides a natural “cost function” (likelihood) for model assessment and leads to more efficient estimation of the model parameters. Second, noise allows us to explicitly model neural firing statistics, and could provide a rigorous basis for a metric distance between spike trains, useful in other contexts [18]. Finally, noise influences the behavior of the model itself, giving rise to phenomena not observed in the purely deterministic model [11]. Our main contribution here is to show that the maximum likelihood estimator (MLE) for the L-NLIF model is computationally tractable. Specifically, we describe an algorithm for computing the likelihood function, and prove that this likelihood function contains no non-global maxima, implying that the MLE can be computed efficiently using standard ascent techniques. The desirable statistical properties of this estimator (e.g. consistency, efficiency) are all inherited “for free” from classical estimation theory. Thus, we have a compact and powerful model for the neural code, and a well-motivated, efficient way to estimate the parameters of this model from extracellular data. The Model We consider a model for which the (dimensionless) subthreshold voltage variable V evolves according to i−1 dV = − gV (t) + k · x(t) + j=0 h(t − tj ) dt + σNt , (1) and resets to Vr whenever V = 1. Here, g denotes the leak conductance, k · x(t) the projection of the input signal x(t) onto the linear kernel k, h is an “afterpotential,” a current waveform of fixed amplitude and shape whose value depends only on the time since the last spike ti−1 , and Nt is an unobserved (hidden) noise process with scale parameter σ. Without loss of generality, the “leak” and “threshold” potential are set at 0 and 1, respectively, so the cell spikes whenever V = 1, and V decays back to 0 with time constant 1/g in the absence of input. Note that the nonlinear behavior of the model is completely determined by only a few parameters, namely {g, σ, Vr }, and h (where the function h is allowed to take values in some low-dimensional vector space). The dynamical properties of this type of “spike response model” have been extensively studied [7]; for example, it is known that this class of models can effectively capture much of the behavior of apparently more biophysically realistic models (e.g. Hodgkin-Huxley). Figures 1 and 2 show several simple comparisons of the L-NLIF and LNP models. In 1, note the fine structure of spike timing in the responses of the L-NLIF model, which is qualitatively similar to in vivo experimental observations [2, 16, 9]). The LNP model fails to capture this fine temporal reproducibility. At the same time, the L-NLIF model is much more flexible and representationally powerful, as demonstrated in Fig. 2: by varying V r or h, for example, we can match a wide variety of dynamical behaviors (e.g. adaptation, bursting, bistability) known to exist in biological neurons. The Estimation Problem Our problem now is to estimate the model parameters {k, σ, g, Vr , h} from a sufficiently rich, dynamic input sequence x(t) together with spike times {ti }. A natural choice is the maximum likelihood estimator (MLE), which is easily proven to be consistent and statistically efficient here. To compute the MLE, we need to compute the likelihood and develop an algorithm for maximizing it. The tractability of the likelihood function for this model arises directly from the linearity of the subthreshold dynamics of voltage V (t) during an interspike interval. In the noiseless case [15], the voltage trace during an interspike interval t ∈ [ti−1 , ti ] is given by the solution to equation (1) with σ = 0:   V0 (t) = Vr e−gt + t ti−1 i−1 k · x(s) + j=0 h(s − tj ) e−g(t−s) ds, (2) A stimulus h current responses 0 0 0 1 )ces( t 0 2. 0 t stimulus x 0 B c responses c=1 h current 0 c=2 2. 0 c=5 1 )ces( t t 0 0 stimulus C 0 h current responses Figure 2: Illustration of diverse behaviors of L-NLIF model. A: Firing rate adaptation. A positive DC current (top) was injected into three model cells differing only in their h currents (shown on left: top, h = 0; middle, h depolarizing; bottom, h hyperpolarizing). Voltage traces of each cell’s response (right, with spikes superimposed) exhibit rate facilitation for depolarizing h (middle), and rate adaptation for hyperpolarizing h (bottom). B: Bursting. The response of a model cell with a biphasic h current (left) is shown as a function of the three different levels of DC current. For small current levels (top), the cell responds rhythmically. For larger currents (middle and bottom), the cell responds with regular bursts of spikes. C: Bistability. The stimulus (top) is a positive followed by a negative current pulse. Although a cell with no h current (middle) responds transiently to the positive pulse, a cell with biphasic h (bottom) exhibits a bistable response: the positive pulse puts it into a stable firing regime which persists until the arrival of a negative pulse. 0 0 1 )ces( t 0 5 0. t 0 which is simply a linear convolution of the input current with a negative exponential. It is easy to see that adding Gaussian noise to the voltage during each time step induces a Gaussian density over V (t), since linear dynamics preserve Gaussianity [8]. This density is uniquely characterized by its first two moments; the mean is given by (2), and its covariance T is σ 2 Eg Eg , where Eg is the convolution operator corresponding to e−gt . Note that this density is highly correlated for nearby points in time, since noise is integrated by the linear dynamics. Intuitively, smaller leak conductance g leads to stronger correlation in V (t) at nearby time points. We denote this Gaussian density G(xi , k, σ, g, Vr , h), where index i indicates the ith spike and the corresponding stimulus chunk xi (i.e. the stimuli that influence V (t) during the ith interspike interval). Now, on any interspike interval t ∈ [ti−1 , ti ], the only information we have is that V (t) is less than threshold for all times before ti , and exceeds threshold during the time bin containing ti . This translates to a set of linear constraints on V (t), expressed in terms of the set Ci = ti−1 ≤t < 1 ∩ V (ti ) ≥ 1 . Therefore, the likelihood that the neuron first spikes at time ti , given a spike at time ti−1 , is the probability of the event V (t) ∈ Ci , which is given by Lxi ,ti (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), Ci the integral of the Gaussian density G(xi , k, σ, g, Vr , h) over the set Ci . sulumits Figure 3: Behavior of the L-NLIF model during a single interspike interval, for a single (repeated) input current (top). Top middle: Ten simulated voltage traces V (t), evaluated up to the first threshold crossing, conditional on a spike at time zero (Vr = 0). Note the strong correlation between neighboring time points, and the sparsening of the plot as traces are eliminated by spiking. Bottom Middle: Time evolution of P (V ). Each column represents the conditional distribution of V at the corresponding time (i.e. for all traces that have not yet crossed threshold). Bottom: Probability density of the interspike interval (isi) corresponding to this particular input. Note that probability mass is concentrated at the points where input drives V0 (t) close to threshold. rhtV secart V 0 rhtV )V(P 0 )isi(P 002 001 )cesm( t 0 0 Spiking resets V to Vr , meaning that the noise contribution to V in different interspike intervals is independent. This “renewal” property, in turn, implies that the density over V (t) for an entire experiment factorizes into a product of conditionally independent terms, where each of these terms is one of the Gaussian integrals derived above for a single interspike interval. The likelihood for the entire spike train is therefore the product of these terms over all observed spikes. Putting all the pieces together, then, the full likelihood is L{xi ,ti } (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), i Ci where the product, again, is over all observed spike times {ti } and corresponding stimulus chunks {xi }. Now that we have an expression for the likelihood, we need to be able to maximize it. Our main result now states, basically, that we can use simple ascent algorithms to compute the MLE without getting stuck in local maxima. Theorem 1. The likelihood L{xi ,ti } (k, σ, g, Vr , h) has no non-global extrema in the parameters (k, σ, g, Vr , h), for any data {xi , ti }. The proof [14] is based on the log-concavity of L{xi ,ti } (k, σ, g, Vr , h) under a certain parametrization of (k, σ, g, Vr , h). The classical approach for establishing the nonexistence of non-global maxima of a given function uses concavity, which corresponds roughly to the function having everywhere non-positive second derivatives. However, the basic idea can be extended with the use of any invertible function: if f has no non-global extrema, neither will g(f ), for any strictly increasing real function g. The logarithm is a natural choice for g in any probabilistic context in which independence plays a role, since sums are easier to work with than products. Moreover, concavity of a function f is strictly stronger than logconcavity, so logconcavity can be a powerful tool even in situations for which concavity is useless (the Gaussian density is logconcave but not concave, for example). Our proof relies on a particular theorem [3] establishing the logconcavity of integrals of logconcave functions, and proceeds by making a correspondence between this type of integral and the integrals that appear in the definition of the L-NLIF likelihood above. We should also note that the proof extends without difficulty to some other noise processes which generate logconcave densities (where white noise has the standard Gaussian density); for example, the proof is nearly identical if Nt is allowed to be colored or nonGaussian noise, with possibly nonzero drift. Computational methods and numerical results Theorem 1 tells us that we can ascend the likelihood surface without fear of getting stuck in local maxima. Now how do we actually compute the likelihood? This is a nontrivial problem: we need to be able to quickly compute (or at least approximate, in a rational way) integrals of multivariate Gaussian densities G over simple but high-dimensional orthants Ci . We discuss two ways to compute these integrals; each has its own advantages. The first technique can be termed “density evolution” [10, 13]. The method is based on the following well-known fact from the theory of stochastic differential equations [8]: given the data (xi , ti−1 ), the probability density of the voltage process V (t) up to the next spike ti satisfies the following partial differential (Fokker-Planck) equation: ∂P (V, t) σ2 ∂ 2 P ∂[(V − Veq (t))P ] = , +g 2 ∂t 2 ∂V ∂V under the boundary conditions (3) P (V, ti−1 ) = δ(V − Vr ), P (Vth , t) = 0; where Veq (t) is the instantaneous equilibrium potential:   i−1 1 Veq (t) = h(t − tj ) . k · x(t) + g j=0 Moreover, the conditional firing rate f (t) satisfies t ti−1 f (s)ds = 1 − P (V, t)dV. Thus standard techniques for solving the drift-diffusion evolution equation (3) lead to a fast method for computing f (t) (as illustrated in Fig. 2). Finally, the likelihood Lxi ,ti (k, σ, g, Vr , h) is simply f (ti ). While elegant and efficient, this density evolution technique turns out to be slightly more powerful than what we need for the MLE: recall that we do not need to compute the conditional rate function f at all times t, but rather just at the set of spike times {ti }, and thus we can turn to more specialized techniques for faster performance. We employ a rapid technique for computing the likelihood using an algorithm due to Genz [6], designed to compute exactly the kinds of multidimensional Gaussian probability integrals considered here. This algorithm works well when the orthants Ci are defined by fewer than ≈ 10 linear constraints on V (t). The number of actual constraints on V (t) during an interspike interval (ti+1 − ti ) grows linearly in the length of the interval: thus, to use this algorithm in typical data situations, we adopt a strategy proposed in our work on the deterministic form of the model [15], in which we discard all but a small subset of the constraints. The key point is that, due to strong correlations in the noise and the fact that the constraints only figure significantly when the V (t) is driven close to threshold, a small number of constraints often suffice to approximate the true likelihood to a high degree of precision. h mitse h eurt K mitse ATS K eurt 0 0 06 )ekips retfa cesm( t 03 0 0 )ekips erofeb cesm( t 001- 002- Figure 4: Demonstration of the estimator’s performance on simulated data. Dashed lines show the true kernel k and aftercurrent h; k is a 12-sample function chosen to resemble the biphasic temporal impulse response of a macaque retinal ganglion cell, while h is function specified in a five-dimensional vector space, whose shape induces a slight degree of burstiness in the model’s spike responses. The L-NLIF model was stimulated with parameters g = 0.05 (corresponding to a membrane time constant of 20 time-samples), σ noise = 0.5, and Vr = 0. The stimulus was 30,000 time samples of white Gaussian noise with a standard deviation of 0.5. With only 600 spikes of output, the estimator is able to retrieve an estimate of k (gray curve) which closely matches the true kernel. Note that the spike-triggered average (black curve), which is an unbiased estimator for the kernel of an LNP neuron [5], differs significantly from this true kernel (see also [15]). The accuracy of this approach improves with the number of constraints considered, but performance is fastest with fewer constraints. Therefore, because ascending the likelihood function requires evaluating the likelihood at many different points, we can make this ascent process much quicker by applying a version of the coarse-to-fine idea. Let L k denote the approximation to the likelihood given by allowing only k constraints in the above algorithm. Then we know, by a proof identical to that of Theorem 1, that Lk has no local maxima; in addition, by the above logic, Lk → L as k grows. It takes little additional effort to prove that argmax Lk → argmax L; thus, we can efficiently ascend the true likelihood surface by ascending the “coarse” approximants Lk , then gradually “refining” our approximation by letting k increase. An application of this algorithm to simulated data is shown in Fig. 4. Further applications to both simulated and real data will be presented elsewhere. Discussion We have shown here that the L-NLIF model, which couples a linear filtering stage to a biophysically plausible and flexible model of neuronal spiking, can be efficiently estimated from extracellular physiological data using maximum likelihood. Moreover, this model lends itself directly to analysis via tools from the modern theory of point processes. For example, once we have obtained our estimate of the parameters (k, σ, g, Vr , h), how do we verify that the resulting model provides an adequate description of the data? This important “model validation” question has been the focus of some recent elegant research, under the rubric of “time rescaling” techniques [4]. While we lack the room here to review these methods in detail, we can note that they depend essentially on knowledge of the conditional firing rate function f (t). Recall that we showed how to efficiently compute this function in the last section and examined some of its qualitative properties in the L-NLIF context in Figs. 2 and 3. We are currently in the process of applying the model to physiological data recorded both in vivo and in vitro, in order to assess whether it accurately accounts for the stimulus preferences and spiking statistics of real neurons. One long-term goal of this research is to elucidate the different roles of stimulus-driven and stimulus-independent activity on the spiking patterns of both single cells and multineuronal ensembles. References [1] B. Aguera y Arcas and A. Fairhall. What causes a neuron to spike? 15:1789–1807, 2003. Neral Computation, [2] M. Berry and M. Meister. Refractoriness and neural precision. Journal of Neuroscience, 18:2200–2211, 1998. [3] V. Bogachev. Gaussian Measures. AMS, New York, 1998. [4] E. Brown, R. Barbieri, V. Ventura, R. Kass, and L. Frank. The time-rescaling theorem and its application to neural spike train data analysis. Neural Computation, 14:325–346, 2002. [5] E. Chichilnisky. A simple white noise analysis of neuronal light responses. Network: Computation in Neural Systems, 12:199–213, 2001. [6] A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. [7] W. Gerstner and W. Kistler. Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cambridge University Press, 2002. [8] S. Karlin and H. Taylor. A Second Course in Stochastic Processes. Academic Press, New York, 1981. [9] J. Keat, P. Reinagel, R. Reid, and M. Meister. Predicting every spike: a model for the responses of visual neurons. Neuron, 30:803–817, 2001. [10] B. Knight, A. Omurtag, and L. Sirovich. The approach of a neuron population firing rate to a new equilibrium: an exact theoretical result. Neural Computation, 12:1045–1055, 2000. [11] J. Levin and J. Miller. Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380:165–168, 1996. [12] L. Paninski. Convergence properties of some spike-triggered analysis techniques. Network: Computation in Neural Systems, 14:437–464, 2003. [13] L. Paninski, B. Lau, and A. Reyes. Noise-driven adaptation: in vitro and mathematical analysis. Neurocomputing, 52:877–883, 2003. [14] L. Paninski, J. Pillow, and E. Simoncelli. Maximum likelihood estimation of a stochastic integrate-and-fire neural encoding model. submitted manuscript (cns.nyu.edu/∼liam), 2004. [15] J. Pillow and E. Simoncelli. Biases in white noise analysis due to non-poisson spike generation. Neurocomputing, 52:109–115, 2003. [16] D. Reich, J. Victor, and B. Knight. The power ratio and the interval map: Spiking models and extracellular recordings. The Journal of Neuroscience, 18:10090–10104, 1998. [17] M. Rudd and L. Brown. Noise adaptation in integrate-and-fire neurons. Neural Computation, 9:1047–1069, 1997. [18] J. Victor. How the brain uses time to represent and process visual information. Brain Research, 886:33–46, 2000. [19] Y. Yu and T. Lee. Dynamical mechanisms underlying contrast gain control in sing le neurons. Physical Review E, 68:011901, 2003.

4 0.41428545 100 nips-2003-Laplace Propagation

Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1

5 0.39966458 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty

Author: Harald Steck, Tommi S. Jaakkola

Abstract: The bootstrap has become a popular method for exploring model (structure) uncertainty. Our experiments with artificial and realworld data demonstrate that the graphs learned from bootstrap samples can be severely biased towards too complex graphical models. Accounting for this bias is hence essential, e.g., when exploring model uncertainty. We find that this bias is intimately tied to (well-known) spurious dependences induced by the bootstrap. The leading-order bias-correction equals one half of Akaike’s penalty for model complexity. We demonstrate the effect of this simple bias-correction in our experiments. We also relate this bias to the bias of the plug-in estimator for entropy, as well as to the difference between the expected test and training errors of a graphical model, which asymptotically equals Akaike’s penalty (rather than one half). 1

6 0.3498809 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

7 0.32197469 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

8 0.31120783 66 nips-2003-Extreme Components Analysis

9 0.30999711 51 nips-2003-Design of Experiments via Information Theory

10 0.30165306 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

11 0.2995128 3 nips-2003-AUC Optimization vs. Error Rate Minimization

12 0.29278177 102 nips-2003-Large Scale Online Learning

13 0.290867 55 nips-2003-Distributed Optimization in Adaptive Networks

14 0.26125363 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks

15 0.25377837 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian

16 0.24931005 111 nips-2003-Learning the k in k-means

17 0.24634992 128 nips-2003-Minimax Embeddings

18 0.24283178 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms

19 0.24087861 48 nips-2003-Convex Methods for Transduction

20 0.24057768 184 nips-2003-The Diffusion-Limited Biochemical Signal-Relay Channel


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (11, 0.034), (29, 0.025), (30, 0.021), (35, 0.047), (39, 0.291), (49, 0.013), (53, 0.151), (69, 0.014), (71, 0.072), (76, 0.036), (78, 0.015), (85, 0.062), (91, 0.081), (99, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81302434 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

Author: Yoshua Bengio, Yves Grandvalet

Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare algorithm performances. In order to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the estimation of uncertainty around the K-fold cross-validation estimator. The main theorem shows that there exists no universal unbiased estimator of the variance of K-fold cross-validation. An analysis based on the eigendecomposition of the covariance matrix of errors helps to better understand the nature of the problem and shows that naive estimators may grossly underestimate variance, as con£rmed by numerical experiments. 1

2 0.57080287 179 nips-2003-Sparse Representation and Its Applications in Blind Source Separation

Author: Yuanqing Li, Shun-ichi Amari, Sergei Shishkin, Jianting Cao, Fanji Gu, Andrzej S. Cichocki

Abstract: In this paper, sparse representation (factorization) of a data matrix is first discussed. An overcomplete basis matrix is estimated by using the K−means method. We have proved that for the estimated overcomplete basis matrix, the sparse solution (coefficient matrix) with minimum l1 −norm is unique with probability of one, which can be obtained using a linear programming algorithm. The comparisons of the l1 −norm solution and the l0 −norm solution are also presented, which can be used in recoverability analysis of blind source separation (BSS). Next, we apply the sparse matrix factorization approach to BSS in the overcomplete case. Generally, if the sources are not sufficiently sparse, we perform blind separation in the time-frequency domain after preprocessing the observed data using the wavelet packets transformation. Third, an EEG experimental data analysis example is presented to illustrate the usefulness of the proposed approach and demonstrate its performance. Two almost independent components obtained by the sparse representation method are selected for phase synchronization analysis, and their periods of significant phase synchronization are found which are related to tasks. Finally, concluding remarks review the approach and state areas that require further study. 1

3 0.57006395 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

Author: Thomas Natschläger, Wolfgang Maass

Abstract: We employ an efficient method using Bayesian and linear classifiers for analyzing the dynamics of information in high-dimensional states of generic cortical microcircuit models. It is shown that such recurrent circuits of spiking neurons have an inherent capability to carry out rapid computations on complex spike patterns, merging information contained in the order of spike arrival with previously acquired context information. 1

4 0.5692957 126 nips-2003-Measure Based Regularization

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1

5 0.5643779 66 nips-2003-Extreme Components Analysis

Author: Max Welling, Christopher Williams, Felix V. Agakov

Abstract: Principal components analysis (PCA) is one of the most widely used techniques in machine learning and data mining. Minor components analysis (MCA) is less well known, but can also play an important role in the presence of constraints on the data distribution. In this paper we present a probabilistic model for “extreme components analysis” (XCA) which at the maximum likelihood solution extracts an optimal combination of principal and minor components. For a given number of components, the log-likelihood of the XCA model is guaranteed to be larger or equal than that of the probabilistic models for PCA and MCA. We describe an efficient algorithm to solve for the globally optimal solution. For log-convex spectra we prove that the solution consists of principal components only, while for log-concave spectra the solution consists of minor components. In general, the solution admits a combination of both. In experiments we explore the properties of XCA on some synthetic and real-world datasets.

6 0.56398785 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

7 0.56354874 30 nips-2003-Approximability of Probability Distributions

8 0.56305611 107 nips-2003-Learning Spectral Clustering

9 0.56062329 120 nips-2003-Locality Preserving Projections

10 0.56020921 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

11 0.55994588 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

12 0.55793738 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

13 0.55726433 40 nips-2003-Bias-Corrected Bootstrap and Model Uncertainty

14 0.55696458 161 nips-2003-Probabilistic Inference in Human Sensorimotor Processing

15 0.5569002 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

16 0.55642116 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

17 0.55639523 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

18 0.55594265 112 nips-2003-Learning to Find Pre-Images

19 0.55506068 115 nips-2003-Linear Dependent Dimensionality Reduction

20 0.55486888 113 nips-2003-Learning with Local and Global Consistency