nips nips2001 nips2001-76 knowledge-graph by maker-knowledge-mining

76 nips-2001-Fast Parameter Estimation Using Green's Functions


Source: pdf

Author: K. Wong, F. Li

Abstract: We propose a method for the fast estimation of hyperparameters in large networks, based on the linear response relation in the cavity method, and an empirical measurement of the Green's function. Simulation results show that it is efficient and precise, when compared with cross-validation and other techniques which require matrix inversion. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 en Abstract We propose a method for the fast estimation of hyperparameters in large networks, based on the linear response relation in the cavity method, and an empirical measurement of the Green's function. [sent-6, score-0.993]

2 1 Introduction It is well known that correct choices of hyperparameters in classification and regression tasks can optimize the complexity of the data model , and hence achieve the best generalization [1]. [sent-8, score-0.22]

3 In recent years various methods have been proposed to estimate the optimal hyperparameters in different contexts, such as neural networks [2], support vector machines [3, 4, 5] and Gaussian processes [5]. [sent-9, score-0.251]

4 While the leave-one-out procedure gives an almost unbiased estimate of the generalization error, it is nevertheless very tedious. [sent-11, score-0.119]

5 In this paper, we propose a new approach to hyperparameter estimation in large systems. [sent-14, score-0.21]

6 It is known that large networks are mean-field systems, so that when one example is removed by the leave-one-out procedure, the background adjustment can be analyzed by a self-consistent perturbation approach. [sent-15, score-0.045]

7 They usually involve a macroscopic number of unknown variables, whose solution is obtained through the inversion of a matrix of macroscopic size, or iteration. [sent-17, score-0.186]

8 Here we take a further step to replace it by a direct measurement of the Green's function via a small number of learning processes. [sent-18, score-0.118]

9 The proposed technique is based on the cavity method, which was adapted from disordered systems in many-body physics. [sent-21, score-0.599]

10 The basis of the cavity method is a self-consistent argument addressing the situation of removing an example from the system. [sent-22, score-0.658]

11 The change on removing an example is described by the Green's function, which is an extremely general technique used in a wide range of quantum and classical problems in many-body physics [8]. [sent-23, score-0.067]

12 In this paper, we consider two applications of the cavity method to hyperparameter estimation, namely the optimal weight decay and the optimal learning time in feedforward networks. [sent-25, score-1.085]

13 In the latter application, the cavity method provides, as far as we are aware of, the only estimate of the hyperparameter beyond empirical stopping criteria and brute force cross-validation. [sent-26, score-0.98]

14 An energy function E is defined with respect to a set of p examples with inputs and outputs respectively given by {IL and y'", JL = 1, . [sent-28, score-0.048]

15 We will first focus on the dynamics of a single-layer feedforward network and generalize the results to multilayer networks later. [sent-32, score-0.195]

16 (1) '" Here f( x'" , y'") represents the error function with respect to example JL. [sent-34, score-0.047]

17 It is expressed in terms of the activation x'" == w· (IL. [sent-35, score-0.064]

18 R( w) represents a regularization term which is introduced to limit the complexity of the network and hence enhance the generalization ability. [sent-36, score-0.125]

19 Hence if we compare the evolution of Wj(t) with another system Wj(t) with a continuous perturbative stimulus Jhj(t), we would have dWj(t) = _~ oE dt Now. [sent-38, score-0.244]

20 (4) J + and the linear response relation Wj(t) = Wj(t) +L J k Now we consider the evolution ofthe network w;'"(t) in which example JL is omitted from the training set. [sent-40, score-0.211]

21 For a system learning macroscopic number of examples, the changes induced by the omission of an example are perturbative, and we can assume that the system has a linear response. [sent-41, score-0.075]

22 Compared with the original network Wj(t), the gradient of the error of example JL now plays the role of the stimulus in (3). [sent-42, score-0.224]

23 When the dynamics has reached the steady state, we arrive at I-' hI-' = x where, = limt--+oo +, OE(XI-' , yl-') oxl-' ' JdS[Ljk ~fGjk (t , s)~r]jN (7) is the susceptibility. [sent-45, score-0.18]

24 At time t , the generalization error is defined as the error function averaged over the distribution of input (, and their corresponding output y, i. [sent-46, score-0.179]

25 The leave-one-out generalization error is an estimate of 109 given in terms ofthe cavity activations hI-' by fg = LI-' 10 (hI-' ,yl-')jp. [sent-50, score-0.926]

26 Hence if we can estimate the Green's function, the cavity activation in (7) provides a convenient way to estimate the leave-one-out generalization error without really having to undergo the validation process. [sent-51, score-0.931]

27 While self-consistent equations for the Green's function have been derived using diagrammatic methods [9], their solutions cannot be computed except for the specific case of time-translational invariant Green's functions , such as those in Adaline learning or linear regression. [sent-52, score-0.051]

28 However, the linear response relation (4) provides a convenient way to measure the Green's function in the general case. [sent-53, score-0.106]

29 The basic idea is to perform two learning processes in parallel, one following the original process (2) and the other having a constant stimulus as in (3) with 6hj (t) = TJ6jk, where 8j k is the Kronecka delta. [sent-54, score-0.106]

30 When the dynamics has reached the steady state, the measurement Wj - Wj yields the quantity TJ Lk dsGjk(t, s). [sent-55, score-0.369]

31 J A simple averaging procedure, replacing all the pairwise measurements between the stimulation node k and observation node j, can be applied in the limit of large N. [sent-56, score-0.136]

32 We first consider the case in which the inputs are independent and normalized, i. [sent-57, score-0.048]

33 , Gjk(t , s) = G(t, s)8jk , independent of the node labels [9], rendering , = limt--+oo J dsG(t, s). [sent-62, score-0.068]

34 In the case that the inputs are correlated and not normalized, we can apply standard procedures of whitening transformation to make them independent and normalized [1]. [sent-63, score-0.126]

35 In large networks, one can use the diagrammatic analysis in [9] to show that the (unknown) distribution of inputs does not change the self-averaging property of the Green's functions after the whitening transformation. [sent-64, score-0.137]

36 Thereafter, the measurement of Green's functions proceeds as described in the simpler case of independent and normalized inputs. [sent-65, score-0.158]

37 Since hyperparameter estimation usually involves a series of computing fg at various hyperparameters, the one-time preprocessing does not increase the computational load significantly. [sent-66, score-0.255]

38 Thus the susceptibility, can be measured by comparing the evolution of two processes: one following the original process (2), and the other having a constant stimulus as in (3) with 8h j (t) = TJ for all j. [sent-67, score-0.133]

39 When the dynamics has reached the steady state, the measurement (Wj - Wj) yields the quantity TJ,. [sent-68, score-0.369]

40 We illustrate the extension to two-layer networks by considering the committee machine, in which the errorfunction takes the form E(2:: a ! [sent-69, score-0.085]

41 (x a), y) , where a = 1,· ··, nh is the label of a hidden node, Xa == wa . [sent-70, score-0.158]

42 The generalization error is thus a function of the cavity activations of the hidden nodes, namely, E9 = 2::JL E(2::a ! [sent-73, score-0.855]

43 When the inputs are independent and normalized, they are related to the generic activations by hJL- JL+'" aE(2::c ! [sent-76, score-0.125]

44 (X~) , yJL) a - Xa ~ "lab a JL ' Xb b (9) where "lab = limt~ oo J dsGab(t, s) is the susceptibility tensor. [sent-77, score-0.111]

45 The Green's function Gab(t, s) represents the response of a weight feeding hidden node a due to a stimulus applied at the gradient with respect to a weight feeding node b. [sent-78, score-0.583]

46 It is obtained by monitoring nh + 1 learning processes, one being original and each of the other nh processes having constant stimuli at the gradients with respect to one of the hidden nodes, viz. [sent-79, score-0.304]

47 ,nh· When the dynamics has reached the steady state, the measurement (w~7 yields the quantity 'f)'Yab. [sent-83, score-0.369]

48 Consider the case that the regularization R(w) takes the form of a weight decay term, R(w) = N 2::ab AabWa . [sent-85, score-0.191]

49 The cavity activations will be given by hJL = JL + '" a Xa ~ b ( 1- ," 11 iJ 2:: j k ~'j(A + Q)~}bk~r ) aE(2:: c ! [sent-87, score-0.676]

50 Note also that second derivatives of the error term have been neglected. [sent-90, score-0.047]

51 To verify the proposed method by simulations, we generate examples from a noisy teacher network which is a committee machine yJL = ~ erf nh (1 a · f ) + yf2B (Jzw (13) Here Ba is the teacher vector at the hidden node a. [sent-91, score-0.791]

52 Learning is done by the gradient descent of the energy function (14) and the weight decay parameter ,X is the hyperparameter to be optimized. [sent-94, score-0.454]

53 The generalization error fg is given by where the averaging is performed over the distribution of input { and noise z. [sent-95, score-0.177]

54 However, this target result is only known by the teacher , since Tab and Rab are not accessible by the student. [sent-100, score-0.213]

55 Four results are compared: the target generalization error observed by the teacher, and those estimated by the cavity method, cross-validation and extended LULOO. [sent-102, score-0.763]

56 It can be seen that the cavity method yields estimates of the optimal weight decay with comparable precision as the other methods. [sent-103, score-0.96]

57 For a more systematic comparison, we search for the optimal weight decay in 10 samples using golden section search [11] for the same parameters as in Fig. [sent-104, score-0.374]

58 Compared with the target results, the standard deviations of the estimated optimal weight decays are 0. [sent-106, score-0.164]

59 24 for the cavity method, sevenfold cross-validation and extended LULOO respectively. [sent-109, score-0.682]

60 In another simulation of 80 samples of the singlelayer perceptron, the estimated optimal weight decays have standard deviations of 1. [sent-110, score-0.205]

61 6 for the cavity method, tenfold cross-validation and extended LULOO respectively (the parameters in the simulations are N = 500, p = 400 and a ranging from 0. [sent-113, score-0.682]

62 To put these results in perspective, we mention that the computational resources needed by the cavity method is much less than the other estimations. [sent-116, score-0.694]

63 For example, in the single-layer perceptrons, the CPU time needed to estimate the optimal weight decay using the golden section search by the teacher, the cavity method, tenfold cross-validation and extended LULOO are in the ratio of 1 : 1. [sent-117, score-1.055]

64 Before concluding this section, we mention that it is possible to derive an expression of the gradient dEg I d,X of the estimated generalization error with respect to the weight decay. [sent-121, score-0.309]

65 This provides us an even more powerful tool for hyperparameter estimation. [sent-122, score-0.164]

66 In the case of the search for one hyperparameter, the gradient enables us to use the binary search for the zero of the gradient, which converges faster than the golden section search. [sent-123, score-0.213]

67 In the single-layer experiment we mentioned, its precision is comparable to fivefold cross-validations, and its CPU time is only 4% more than the teacher's search. [sent-124, score-0.113]

68 In the case of more than one hyperparameters, the gradient information will save us the need for an exhaustive search over a multidimensional hyperparameter space. [sent-126, score-0.265]

69 3 Dynamical Hyperparameter Estimation The second example concerns the estimation of a dynamical hyperparameter, namely the optimal early stopping time, in cases where overtraining may plague the generalization ability at the steady state. [sent-127, score-0.52]

70 In perceptrons, when the examples are noisy and the weight decay is weak, the generalization error decreases in the early stage of learning, reaches a minimum and then increases towards its asymptotic value [12, 9]. [sent-128, score-0.38]

71 Since the early stopping point sets in before the system reaches the steady state, most analyses based on the equilibrium state are not applicable. [sent-129, score-0.274]

72 Cross-validation stopping has been proposed as an empirical method to control overtraining [13]. [sent-130, score-0.227]

73 Here we propose the cavity method as a convenient alternative. [sent-131, score-0.694]

74 52 G----8 target e Q) (b) G----EJ cavity 0 - 0 LULOO <= 0 ~ . [sent-133, score-0.599]

75 40 o 0 weight decay A 2 weight decay A Figure 1: (a-d) The dependence ofthe generalization error of the multilayer perceptron on the weight decay for N = 200, p = 700, nh = 3, (J = 0. [sent-142, score-0.92]

76 The solid symbols locate the optimal weight decays estimated by the teacher (circle), the cavity method (square), extended LULOO (diamond) and sevenfold cross-validation (triangle) . [sent-144, score-1.169]

77 In single-layer perceptrons, the cavity activations of the examples evolve according to (6), enabling us to estimate the dynamical evolution of the estimated generalization error when learning proceeds. [sent-145, score-0.939]

78 The remaining issue is the measurement of the time-dependent Green's function. [sent-146, score-0.118]

79 Again, assuming normalized and independent inputs with (~j) = 0 and (~j~k) = Jjk , we can see from (4) that the measurement (Wj(t) - Wj(t)) yields the quantity 1]G(t, 0). [sent-148, score-0.277]

80 Such are the cases for Adaline learning and linear regression [9], where the cavity activation can be written as h'"(t) = x'"(t) + J dsG(t - s, 0) OE(X'"(S), y'"). [sent-152, score-0.697]

81 ox,"(s) (16) This allows us to estimate the generalization error Eg(t) via Eg(t) 2:. [sent-153, score-0.166]

82 ," E(h'"(t), y'")/p, whose minimum in time determines the early stopping point. [sent-154, score-0.181]

83 To verify the proposed method in linear regression, we randomly generate examples from a noisy teacher with y'" = iJ . [sent-155, score-0.272]

84 f'" + (Jzw Here iJ is the teacher vector with B2 = 1. [sent-156, score-0.213]

85 Learning is done by the gradient descent of the energy function E(t) = 2:. [sent-158, score-0.099]

86 The generalization error Eg(t) is the error av- e; eraged over the distribution of input [ and their corresponding output y, i. [sent-161, score-0.179]

87 As far as the teacher is concerned, Eg(t) can be computed as Eg(t) = (1 - 2R(t) + Q(t) + (J2)/2. [sent-165, score-0.213]

88 Three results are compared: the teacher's estimate, the cavity method and cross-validation. [sent-169, score-0.658]

89 Again, we see that the cavity method yields estimates of the early stopping time with comparable precision as cross-validation. [sent-171, score-0.914]

90 The ratio of the CPU time between the cavity method and fivefold cross-validation is 1 : 1. [sent-172, score-0.735]

91 For nonlinear regression and multilayer networks, the Green 's functions are not time-translational invariant. [sent-174, score-0.099]

92 Preliminary results for the determination of the early stopping point are satisfactory and final results will be presented elsewhere. [sent-176, score-0.181]

93 7 0 2 time t 0 2 time t 0 2 4 time t Figure 2: (a-f) The evolution of the generalization error of linear regression for N = 500, p = 600 and (J = 1. [sent-191, score-0.228]

94 The solid symbols locate the early stopping points estimated by the teacher (circle), the cavity method (square) and fivefold crossvalidation (diamond). [sent-192, score-1.18]

95 4 Conclusion We have proposed a method for the fast estimation of hyperparameters in large networks, based on the linear response relation in the cavity method, combined with an empirical method of measuring the Green's function. [sent-193, score-0.934]

96 It does not require the validation process or the inversion of matrices of macroscopic size, and hence its speed compares favorably with cross-validation and other perturbative approaches such as extended LULOO. [sent-195, score-0.252]

97 For multilayer networks, we will explore further speedup of the Green 's function measurement by multiplexing the stimuli to the different hidden units into a single network, to be compared with a reference network. [sent-196, score-0.23]

98 Our initial success indicates that it is possible to generalize the method to more complicated systems in the future. [sent-198, score-0.059]

99 The concept of Green's functions is very general, and its measurement by comparing the states of a stimulated system with a reference one can be adopted to general cases with suitable adaptation. [sent-199, score-0.118]

100 It would be interesting to consider how the proposed method can contribute to these cases. [sent-201, score-0.059]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cavity', 0.599), ('green', 0.263), ('teacher', 0.213), ('wj', 0.213), ('luloo', 0.204), ('hyperparameter', 0.164), ('jl', 0.144), ('stopping', 0.124), ('measurement', 0.118), ('decay', 0.116), ('nh', 0.111), ('yjl', 0.102), ('hyperparameters', 0.101), ('steady', 0.093), ('oe', 0.089), ('eg', 0.086), ('generalization', 0.085), ('activations', 0.077), ('fivefold', 0.077), ('golden', 0.077), ('perturbative', 0.077), ('susceptibility', 0.077), ('macroscopic', 0.075), ('weight', 0.075), ('stimulus', 0.071), ('node', 0.068), ('gradient', 0.066), ('multilayer', 0.065), ('activation', 0.064), ('evolution', 0.062), ('ae', 0.061), ('hong', 0.061), ('method', 0.059), ('early', 0.057), ('xa', 0.057), ('decays', 0.053), ('adaline', 0.051), ('diagrammatic', 0.051), ('dsg', 0.051), ('dsgjk', 0.051), ('jhj', 0.051), ('jzw', 0.051), ('locate', 0.051), ('rab', 0.051), ('sevenfold', 0.051), ('tab', 0.051), ('tenfold', 0.051), ('xian', 0.051), ('cpu', 0.051), ('inputs', 0.048), ('hidden', 0.047), ('error', 0.047), ('perceptrons', 0.046), ('ij', 0.046), ('estimation', 0.046), ('networks', 0.045), ('dynamics', 0.045), ('fg', 0.045), ('overtraining', 0.044), ('gjk', 0.044), ('dwj', 0.044), ('winther', 0.044), ('lab', 0.042), ('reached', 0.042), ('simulation', 0.041), ('muller', 0.041), ('committee', 0.04), ('wb', 0.04), ('diamond', 0.04), ('kong', 0.04), ('network', 0.04), ('normalized', 0.04), ('ofthe', 0.039), ('yields', 0.039), ('feeding', 0.038), ('whitening', 0.038), ('wong', 0.038), ('response', 0.037), ('precision', 0.036), ('convenient', 0.036), ('il', 0.036), ('optimal', 0.036), ('mention', 0.036), ('quantum', 0.036), ('xb', 0.036), ('inversion', 0.036), ('search', 0.035), ('dynamical', 0.035), ('processes', 0.035), ('estimate', 0.034), ('dt', 0.034), ('oo', 0.034), ('regression', 0.034), ('descent', 0.033), ('relation', 0.033), ('validation', 0.032), ('quantity', 0.032), ('ba', 0.032), ('extended', 0.032), ('physics', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 76 nips-2001-Fast Parameter Estimation Using Green's Functions

Author: K. Wong, F. Li

Abstract: We propose a method for the fast estimation of hyperparameters in large networks, based on the linear response relation in the cavity method, and an empirical measurement of the Green's function. Simulation results show that it is efficient and precise, when compared with cross-validation and other techniques which require matrix inversion. 1

2 0.13527831 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

Author: Lehel Csató, Manfred Opper, Ole Winther

Abstract: The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka’s expectation propagation. Lastly, we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification and density estimation with Gaussian processes and on an independent component analysis problem.

3 0.13368493 79 nips-2001-Gaussian Process Regression with Mismatched Models

Author: Peter Sollich

Abstract: Learning curves for Gaussian process regression are well understood when the 'student' model happens to match the 'teacher' (true data generation process). I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning with excessively strong smoothness assumptions can be particularly dangerous: For example, a student with a standard radial basis function covariance function will learn a rougher teacher function only logarithmically slowly. All predictions are confirmed by simulations. 1

4 0.092786171 35 nips-2001-Analysis of Sparse Bayesian Learning

Author: Anita C. Faul, Michael E. Tipping

Abstract: The recent introduction of the 'relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. Using a particular form of Gaussian parameter prior, 'learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. It is further shown that if a derived 'sparsity criterion' is satisfied, this maximum is exactly equivalent to 'pruning' the corresponding parameter from the model. 1

5 0.086750858 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

Author: Manfred Opper, Robert Urbanczik

Abstract: Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1

6 0.07285022 21 nips-2001-A Variational Approach to Learning Curves

7 0.071912259 183 nips-2001-The Infinite Hidden Markov Model

8 0.064044029 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

9 0.061775718 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

10 0.058271121 8 nips-2001-A General Greedy Approximation Algorithm with Applications

11 0.056834683 13 nips-2001-A Natural Policy Gradient

12 0.055724356 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons

13 0.055721056 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

14 0.05290411 82 nips-2001-Generating velocity tuning by asymmetric recurrent connections

15 0.05120676 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

16 0.049910054 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks

17 0.047411412 43 nips-2001-Bayesian time series classification

18 0.047143459 139 nips-2001-Online Learning with Kernels

19 0.044582736 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

20 0.04410759 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.171), (1, -0.039), (2, 0.002), (3, -0.016), (4, -0.012), (5, -0.046), (6, 0.076), (7, -0.024), (8, -0.001), (9, -0.01), (10, -0.026), (11, 0.072), (12, 0.01), (13, -0.094), (14, 0.04), (15, 0.092), (16, 0.017), (17, 0.02), (18, 0.103), (19, -0.014), (20, -0.01), (21, -0.117), (22, 0.051), (23, 0.119), (24, 0.13), (25, 0.092), (26, -0.002), (27, -0.051), (28, -0.018), (29, 0.215), (30, 0.066), (31, 0.044), (32, -0.028), (33, 0.038), (34, 0.024), (35, 0.079), (36, 0.03), (37, 0.036), (38, -0.015), (39, -0.136), (40, -0.072), (41, 0.194), (42, 0.027), (43, 0.133), (44, -0.008), (45, -0.014), (46, 0.059), (47, 0.061), (48, -0.028), (49, -0.067)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92921072 76 nips-2001-Fast Parameter Estimation Using Green's Functions

Author: K. Wong, F. Li

Abstract: We propose a method for the fast estimation of hyperparameters in large networks, based on the linear response relation in the cavity method, and an empirical measurement of the Green's function. Simulation results show that it is efficient and precise, when compared with cross-validation and other techniques which require matrix inversion. 1

2 0.77066338 79 nips-2001-Gaussian Process Regression with Mismatched Models

Author: Peter Sollich

Abstract: Learning curves for Gaussian process regression are well understood when the 'student' model happens to match the 'teacher' (true data generation process). I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning with excessively strong smoothness assumptions can be particularly dangerous: For example, a student with a standard radial basis function covariance function will learn a rougher teacher function only logarithmically slowly. All predictions are confirmed by simulations. 1

3 0.55481654 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons

Author: Shun-ichi Amari, Hyeyoung Park, Tomoko Ozeki

Abstract: Singularities are ubiquitous in the parameter space of hierarchical models such as multilayer perceptrons. At singularities, the Fisher information matrix degenerates, and the Cramer-Rao paradigm does no more hold, implying that the classical model selection theory such as AIC and MDL cannot be applied. It is important to study the relation between the generalization error and the training error at singularities. The present paper demonstrates a method of analyzing these errors both for the maximum likelihood estimator and the Bayesian predictive distribution in terms of Gaussian random fields, by using simple models. 1

4 0.55399495 21 nips-2001-A Variational Approach to Learning Curves

Author: Dörthe Malzahn, Manfred Opper

Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.

5 0.49549779 35 nips-2001-Analysis of Sparse Bayesian Learning

Author: Anita C. Faul, Michael E. Tipping

Abstract: The recent introduction of the 'relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. Using a particular form of Gaussian parameter prior, 'learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. It is further shown that if a derived 'sparsity criterion' is satisfied, this maximum is exactly equivalent to 'pruning' the corresponding parameter from the model. 1

6 0.49423435 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

7 0.43942434 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

8 0.41960043 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks

9 0.40488854 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

10 0.37352234 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

11 0.36534935 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network

12 0.35528231 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

13 0.35500351 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

14 0.34626213 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

15 0.34501106 169 nips-2001-Small-World Phenomena and the Dynamics of Information

16 0.33774662 165 nips-2001-Scaling Laws and Local Minima in Hebbian ICA

17 0.32937461 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks

18 0.31694838 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task

19 0.31388313 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique

20 0.31305698 61 nips-2001-Distribution of Mutual Information


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.051), (17, 0.04), (19, 0.04), (27, 0.089), (30, 0.086), (36, 0.342), (38, 0.011), (59, 0.032), (72, 0.075), (79, 0.051), (83, 0.02), (91, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79955846 76 nips-2001-Fast Parameter Estimation Using Green's Functions

Author: K. Wong, F. Li

Abstract: We propose a method for the fast estimation of hyperparameters in large networks, based on the linear response relation in the cavity method, and an empirical measurement of the Green's function. Simulation results show that it is efficient and precise, when compared with cross-validation and other techniques which require matrix inversion. 1

2 0.73498636 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

Author: T. Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.

3 0.7004348 193 nips-2001-Unsupervised Learning of Human Motion Models

Author: Yang Song, Luis Goncalves, Pietro Perona

Abstract: This paper presents an unsupervised learning algorithm that can derive the probabilistic dependence structure of parts of an object (a moving human body in our examples) automatically from unlabeled data. The distinguished part of this work is that it is based on unlabeled data, i.e., the training features include both useful foreground parts and background clutter and the correspondence between the parts and detected features are unknown. We use decomposable triangulated graphs to depict the probabilistic independence of parts, but the unsupervised technique is not limited to this type of graph. In the new approach, labeling of the data (part assignments) is taken as hidden variables and the EM algorithm is applied. A greedy algorithm is developed to select parts and to search for the optimal structure based on the differential entropy of these variables. The success of our algorithm is demonstrated by applying it to generate models of human motion automatically from unlabeled real image sequences.

4 0.56566972 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

5 0.49652904 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

6 0.49525324 143 nips-2001-PAC Generalization Bounds for Co-training

7 0.48428828 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

8 0.48001739 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

9 0.47169837 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

10 0.47064775 102 nips-2001-KLD-Sampling: Adaptive Particle Filters

11 0.46957952 31 nips-2001-Algorithmic Luckiness

12 0.46909815 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding

13 0.4652186 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

14 0.4635545 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

15 0.46126035 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

16 0.45979887 60 nips-2001-Discriminative Direction for Kernel Classifiers

17 0.45976263 56 nips-2001-Convolution Kernels for Natural Language

18 0.4595921 121 nips-2001-Model-Free Least-Squares Policy Iteration

19 0.4587034 62 nips-2001-Duality, Geometry, and Support Vector Regression

20 0.45866966 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning