nips nips2000 nips2000-97 knowledge-graph by maker-knowledge-mining

97 nips-2000-Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping


Source: pdf

Author: Rich Caruana, Steve Lawrence, C. Lee Giles

Abstract: The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest two reasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity. 2) Regardless of size, nets learn task subcomponents in similar sequence. Big nets pass through stages similar to those learned by smaller nets. Early stopping can stop training the large net when it generalizes comparably to a smaller net. We also show that conjugate gradient can yield worse generalization because it overfits regions of low non-linearity when learning to fit regions of high non-linearity.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The conventional wisdom is that backprop nets with excess hidden units generalize poorly. [sent-9, score-1.115]

2 We show that nets with excess capacity generalize well when trained with backprop and early stopping. [sent-10, score-1.304]

3 Experiments suggest two reasons for this: 1) Overfitting can vary significantly in different regions of the model. [sent-11, score-0.117]

4 Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity. [sent-12, score-0.845]

5 2) Regardless of size, nets learn task subcomponents in similar sequence. [sent-13, score-0.707]

6 Big nets pass through stages similar to those learned by smaller nets. [sent-14, score-0.854]

7 Early stopping can stop training the large net when it generalizes comparably to a smaller net. [sent-15, score-0.628]

8 We also show that conjugate gradient can yield worse generalization because it overfits regions of low non-linearity when learning to fit regions of high non-linearity. [sent-16, score-0.412]

9 1 Introduction It is commonly believed that large multi-layer perceptrons (MLPs) generalize poorly: nets with too much capacity overfit the training data. [sent-17, score-1.016]

10 Restricting net capacity prevents overfitting because the net has insufficient capacity to learn models that are too complex. [sent-18, score-1.336]

11 This belief is consistent with a VC-dimension analysis of net capacity vs. [sent-19, score-0.482]

12 generalization: the more free parameters in the net the larger the VC-dimension of the hypothesis space, and the less likely the training sample is large enough to select a (nearly) correct hypothesis [2]. [sent-20, score-0.38]

13 Once it became feasible to train large nets on real problems, a number of MLP users noted that the overfitting they expected from nets with excess capacity did not occur. [sent-21, score-1.872]

14 Large nets appeared to generalize as well as smaller nets - sometimes better. [sent-22, score-1.382]

15 The earliest report of this that we are aware of is Martin and Pittman in 1991: "We find only marginal and inconsistent indications that constraining net capacity improves generalization" [7]. [sent-23, score-0.482]

16 We present empirical results showing that MLPs with excess capacity often do not overfit. [sent-24, score-0.351]

17 On the contrary, we observe that large nets often generalize better than small nets of sufficient capacity. [sent-25, score-1.368]

18 Backprop appears to use excess capacity to better fit regions of high non-linearity, while still fitting regions of low non-linearity smoothly. [sent-26, score-0.588]

19 (This desirable behavior can disappear if a fast training algorithm such as conjugate gradient is used instead of backprop. [sent-27, score-0.104]

20 ) Nets with excess capacity trained with backprop appear first to learn models similar to models learned by smaller nets. [sent-28, score-0.885]

21 If early stopping is used, training of the large net can be halted when the large net's model is similar to models learned by smaller nets. [sent-29, score-0.869]

22 The large MLP does not overfit significantly more than the small MLP. [sent-38, score-0.159]

23 2 Overfitting Much has been written about overfitting and the bias/variance tradeoff in neural nets and other machine learning models [2, 12, 4, 8, 5, 13, 6] . [sent-39, score-0.939]

24 We fit polynomial models with orders 2-20 to the data. [sent-47, score-0.162]

25 As the order (and number of parameters) increases, however, significant overfitting (poor generalization) occurs. [sent-50, score-0.287]

26 At order 20, the polynomial fits the training data well, but interpolates poorly. [sent-51, score-0.143]

27 We used a single hidden layer MLP, backpropagation (BP), and 100,000 stochastic updates. [sent-53, score-0.131]

28 ) As with polynomials, the smallest net with one hidden unit (HU) (4 weights weights) underfits the data. [sent-58, score-0.469]

29 MLPs with seven times as many parameters as data points trained with BP do not significantly overfit this data. [sent-61, score-0.251]

30 3 Local Overfitting Regularization methods such as weight decay typically assume that overfitting is a global phenomenon. [sent-63, score-0.264]

31 But overfitting can vary significantly in different regions of a model. [sent-64, score-0.381]

32 In Figure 2 the order 6 model fits the left region O roe r 2 Approxim . [sent-67, score-0.091]

33 The overfitting behavior differs in the left and right hand regions. [sent-93, score-0.264]

34 The order 6 model underfits the region on the right, and the order 10 model fits it better. [sent-95, score-0.147]

35 Figure 3 shows MLPs trained on the same data (20,000 batch updates, learning rate linearly reduced to zero starting at 0. [sent-97, score-0.082]

36 Larger nets, however, fit the entire function well without significant overfitting in the left region. [sent-100, score-0.328]

37 The ability of MLPs to fit both regions of low and high non-linearity well (without overfitting) depends on the training algorithm. [sent-101, score-0.204]

38 CG results in lower training error for this problem, but overfits significantly. [sent-103, score-0.098]

39 Large BP nets generalize better on this problem -- even the optimal size CG net is prone to overfitting. [sent-105, score-1.067]

40 When the net is large enough to fit the region of high non-linearity, overfitting is often seen in the region of low non-linearity. [sent-107, score-0.785]

41 4 Generalization, Network Capacity, and Early Stopping The results in Sections 2 and 3 suggest that BP nets are less prone to overfitting than expected. [sent-108, score-0.915]

42 1 Results For each problem we used small training sets (100-1000 points, depending on the problem) so that overfitting was possible. [sent-120, score-0.303]

43 We trained fully connected feedforward MLPs with one hidden layer whose size varied from 2 to 800 HU (about 500-100,000 parameters). [sent-121, score-0.207]

44 All the nets were trained with BP using stochastic updates, learning rate 0. [sent-122, score-0.704]

45 We used early stopping for regularization because it doesn't interfere with backprop's ability to control capacity locally. [sent-125, score-0.427]

46 Early stopping combined with backprop is so effective that very large nets can be trained without significant overfitting. [sent-126, score-1.021]

47 - "- " '/ / 1 Hidden Unit 4 Hidden Units 10 Hidden Units 100 Hidden Units Figure 3: MLP approximation using backpropagation (BP) training of data from Equation 1 as the number of hidden units is increased. [sent-129, score-0.24]

48 Examining the results for all seven problems, we observe that on only three (Base 1, Base 2, and ALVINN), do nets that are too large yield worse generalization than smaller networks, but the loss is surprisingly small. [sent-135, score-0.868]

49 Many trials were required before statistical tests confirmed that the differences between the optimal size net and the largest net were significant. [sent-136, score-0.653]

50 Moreover, the results suggest that generalization is hurt more by using a net that is a little too small than by using one that is far too large, i. [sent-137, score-0.414]

51 , it is better to make nets too large than too small. [sent-139, score-0.678]

52 For most tasks and net sizes, we trained well beyond the point where generalization performance peaked. [sent-140, score-0.47]

53 , early stopping) is critical for nets of all sizes - not just ones that are too big. [sent-146, score-0.747]

54 2 Why Excess Capacity Does Not Hurt BP nets initialized with small weights can develop large weights only after the number of updates is large. [sent-149, score-0.764]

55 Thus BP nets consider hypotheses with small weights before hypotheses with large weights. [sent-150, score-0.741]

56 Nets with large weights have more representational power, so simple hypotheses are explored before complex hypotheses. [sent-151, score-0.096]

57 - - , - - - - , 2 hidden units +8 hidden units -+-32 hidden units -E}-0. [sent-175, score-0.474]

58 22 2 hidden units +8 32 8 2 hidden hidden hidden hidden units -+-units -8- units ·K··· units -A- . [sent-210, score-0.72]

59 We analyzed what nets of different size learn while they are trained. [sent-231, score-0.714]

60 We compared the input/output behavior of nets at different stages of learning on large samples of test patterns. [sent-232, score-0.683]

61 We compare the input/output behavior of two nets by computing the squared error between the predictions made by the two nets. [sent-233, score-0.695]

62 If two nets make the same predictions for all test cases, they have learned the same model (even though each model is represented differently), and the squared error between the two models is zero. [sent-234, score-0.809]

63 If two nets make different predictions for test cases, they have learned different models, and the squared error between them is large. [sent-235, score-0.756]

64 This is not the error the models make predicting the true labels, but the difference between predictions made by two different models. [sent-236, score-0.102]

65 Two models can have poor generalization (large error on true labels), but have near zero error compared to each other if they are similar models. [sent-237, score-0.215]

66 But two models with good generalization (low error on true labels) must have low error compared to each other. [sent-238, score-0.238]

67 The first graph in Figure 5 shows learning curves for nets with 10,25, 50, 100, 200, and 400 HU trained on NETtalk. [sent-239, score-0.704]

68 For each size, we saved the net from the epoch that generalized best on a large test set. [sent-240, score-0.414]

69 We then trained a BP net with 800 HU, and after each epoch compared this net's model with the best models saved for nets of 10-400 HU. [sent-242, score-1.138]

70 This lets us compare the sequence of models learned by the 800 HU net to the best models learned by smaller nets. [sent-243, score-0.65]

71 The horizontal axis is the number of backprop passes applied to the 800 HU net. [sent-245, score-0.13]

72 The vertical axis is the error between the 800 HU net model and the best model for each smaller net. [sent-246, score-0.448]

73 The 800 HU net starts off distant from the good smaller models, then becomes similar to the good models, and then diverges from them. [sent-247, score-0.47]

74 What is interesting is that the 800 HU net first becomes closest to the best S imilarity o f BOO HU Net DUring T raining to S maller Size Peak Performers 1 000 ,* ,. [sent-249, score-0.406]

75 ,-,-,--------,r'--------"----,--"---------r--------, H~ 10 hu 25hu 50hu 100 hu 200hu 400 hu t ~ . [sent-253, score-0.822]

76 -=- ----it- --: 200 o ~----~~----~-----~-----~ o 50000 1000 00 150000 2000 00 Patte rn Pre s entat IOns Figure 6: I/O similarity during training between an 800 hidden unit net and smaller nets (10, 25 , 50, 100,200, and 400 hldden units) trained on NETtalk. [sent-268, score-1.229]

77 As it is trained, the 800 HU net learns a sequence of models similar to the models learned by smaller nets. [sent-270, score-0.595]

78 If early stopping is used, training of the 800 HU net can be stopped when it behaves similar to the best model that could be learned with nets of 10, 25, 50, . [sent-271, score-1.306]

79 Large BP nets learn models similar to those learned by smaller nets. [sent-275, score-0.911]

80 If a BP net with too much capacity would overjit, early stopping could stop training when the model was similar to a model that would have been learned by a smaller net of optimal size. [sent-276, score-1.259]

81 The error between models is about 200-400, yet the generalization error is about 1600. [sent-277, score-0.185]

82 With early stopping, what counts is the closest approach of each model to the target function, not where models end up late in training. [sent-279, score-0.185]

83 With early stopping there is little disadvantage to using models that are too large because their learning trajectories are similar to those followed by smaller nets of more optimal size. [sent-280, score-1.05]

84 5 Related Work Our results show that models learned by backprop are biased towards "smooth" solutions. [sent-281, score-0.265]

85 As nets with excess capacity are trained, they first explore smoother models similar to the models smaller nets would have learned. [sent-282, score-1.821]

86 Weigend [11] performed an experiment that showed BP nets learn a problem's eigenvectors in sequence, learning the 1st eigenvector first, then the 2nd, etc. [sent-283, score-0.697]

87 Bartlett notes: "the VC-bounds seem loose; neural nets often peiform successfully with training sets that are considerably smaller than the number of weights. [sent-286, score-0.771]

88 This result suggests that a net with smaller weights will generalize better than a similar net with large weights. [sent-288, score-0.88]

89 Examining the weights from BP and CG nets shows that BP training typically results in smaller weights. [sent-289, score-0.791]

90 6 Summary Nets of all sizes overfit some problems. [sent-290, score-0.134]

91 But generalization is surprisingly insensitive to excess capacity if the net is trained with backprop. [sent-291, score-0.801]

92 Because BP nets with excess capacity learn a sequence of models functionally similar to what smaller nets learn, early stopping can often be used to stop training large nets when they have learned models similar to those learned by smaller nets of optimal size. [sent-292, score-3.683]

93 This means there is little loss in generalization performance for nets with excess capacity if early stopping can be used. [sent-293, score-1.255]

94 Overfitting can vary significantly in different regions of the model. [sent-295, score-0.117]

95 MLPs trained with BP use excess parameters to improve fit in regions of high non-linearity, while not significantly overfitting other regions. [sent-296, score-0.662]

96 Nets trained with conjugate gradient, however, are more sensitive to net size. [sent-297, score-0.432]

97 BP nets appear to be better than CG nets at avoiding overfitting in regions with different degrees of non-linearity, perhaps because CG is more effective at learning more complex functions that overfit training data, while BP is biased toward learning smoother functions. [sent-298, score-1.803]

98 For valid generalization the size of the weights is more important than the size of the network. [sent-301, score-0.194]

99 The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. [sent-345, score-0.134]

100 On overfitting and the effective number of hidden units. [sent-361, score-0.375]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nets', 0.622), ('net', 0.308), ('hu', 0.274), ('overfitting', 0.264), ('bp', 0.227), ('mlps', 0.187), ('capacity', 0.174), ('excess', 0.157), ('stopping', 0.131), ('backprop', 0.13), ('overfit', 0.1), ('early', 0.091), ('smaller', 0.09), ('hidden', 0.088), ('trained', 0.082), ('generalization', 0.08), ('cg', 0.079), ('units', 0.07), ('regions', 0.069), ('tio', 0.067), ('fit', 0.064), ('learned', 0.061), ('learn', 0.055), ('models', 0.053), ('mlp', 0.052), ('fu', 0.052), ('functi', 0.05), ('nettalk', 0.05), ('generalize', 0.048), ('base', 0.047), ('polynomial', 0.045), ('backpropagation', 0.043), ('seven', 0.043), ('conjugate', 0.042), ('closest', 0.041), ('weights', 0.04), ('training', 0.039), ('size', 0.037), ('presentations', 0.036), ('fits', 0.036), ('morgan', 0.036), ('volume', 0.035), ('sizes', 0.034), ('alvinn', 0.033), ('arget', 0.033), ('caruana', 0.033), ('eak', 0.033), ('ithou', 0.033), ('oroer', 0.033), ('overfits', 0.033), ('raining', 0.033), ('schedules', 0.033), ('underfits', 0.033), ('weigend', 0.033), ('large', 0.033), ('lawrence', 0.032), ('region', 0.032), ('low', 0.032), ('regularization', 0.031), ('kaufmann', 0.03), ('similar', 0.03), ('oj', 0.029), ('updates', 0.029), ('prone', 0.029), ('rain', 0.029), ('denver', 0.029), ('pea', 0.028), ('stages', 0.028), ('stop', 0.027), ('pages', 0.026), ('error', 0.026), ('saved', 0.026), ('hurt', 0.026), ('significantly', 0.026), ('labels', 0.025), ('advances', 0.024), ('best', 0.024), ('squared', 0.024), ('predictions', 0.023), ('effective', 0.023), ('better', 0.023), ('order', 0.023), ('hypotheses', 0.023), ('gradient', 0.023), ('giles', 0.023), ('epoch', 0.023), ('pass', 0.023), ('examining', 0.023), ('vary', 0.022), ('jj', 0.021), ('ng', 0.021), ('biased', 0.021), ('good', 0.021), ('polynomials', 0.02), ('often', 0.02), ('nodes', 0.02), ('smoother', 0.02), ('eigenvector', 0.02), ('autonomous', 0.02), ('robot', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 97 nips-2000-Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping

Author: Rich Caruana, Steve Lawrence, C. Lee Giles

Abstract: The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest two reasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity. 2) Regardless of size, nets learn task subcomponents in similar sequence. Big nets pass through stages similar to those learned by smaller nets. Early stopping can stop training the large net when it generalizes comparably to a smaller net. We also show that conjugate gradient can yield worse generalization because it overfits regions of low non-linearity when learning to fit regions of high non-linearity.

2 0.11426388 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

Author: Guy Mayraz, Geoffrey E. Hinton

Abstract: The product of experts learning procedure [1] can discover a set of stochastic binary features that constitute a non-linear generative model of handwritten images of digits. The quality of generative models learned in this way can be assessed by learning a separate model for each class of digit and then comparing the unnormalized probabilities of test images under the 10 different class-specific models. To improve discriminative performance, it is helpful to learn a hierarchy of separate models for each digit class. Each model in the hierarchy has one layer of hidden units and the nth level model is trained on data that consists of the activities of the hidden units in the already trained (n - l)th level model. After training, each level produces a separate, unnormalized log probabilty score. With a three-level hierarchy for each of the 10 digit classes, a test image produces 30 scores which can be used as inputs to a supervised, logistic classification network that is trained on separate data. On the MNIST database, our system is comparable with current state-of-the-art discriminative methods, demonstrating that the product of experts learning procedure can produce effective generative models of high-dimensional data. 1 Learning products of stochastic binary experts Hinton [1] describes a learning algorithm for probabilistic generative models that are composed of a number of experts. Each expert specifies a probability distribution over the visible variables and the experts are combined by multiplying these distributions together and renormalizing. (1) where d is a data vector in a discrete space, Om is all the parameters of individual model m, Pm(dIOm) is the probability of d under model m, and c is an index over all possible vectors in the data space. A Restricted Boltzmann machine [2, 3] is a special case of a product of experts in which each expert is a single, binary stochastic hidden unit that has symmetrical connections to a set of visible units, and connections between the hidden units are forbidden. Inference in an RBM is much easier than in a general Boltzmann machine and it is also much easier than in a causal belief net because there is no explaining away. There is therefore no need to perform any iteration to determine the activities of the hidden units. The hidden states, Sj , are conditionally independent given the visible states, Si, and the distribution of Sj is given by the standard logistic function : 1 p(Sj = 1) = (2) 1 + exp( - Li WijSi) Conversely, the hidden states of an RBM are marginally dependent so it is easy for an RBM to learn population codes in which units may be highly correlated. It is hard to do this in causal belief nets with one hidden layer because the generative model of a causal belief net assumes marginal independence. An RBM can be trained using the standard Boltzmann machine learning algorithm which follows a noisy but unbiased estimate of the gradient of the log likelihood of the data. One way to implement this algorithm is to start the network with a data vector on the visible units and then to alternate between updating all of the hidden units in parallel and updating all of the visible units in parallel. Each update picks a binary state for a unit from its posterior distribution given the current states of all the units in the other set. If this alternating Gibbs sampling is run to equilibrium, there is a very simple way to update the weights so as to minimize the Kullback-Leibler divergence, QOIIQoo, between the data distribution, QO, and the equilibrium distribution of fantasies over the visible units, Qoo, produced by the RBM [4]: flWij oc QO - Q~ (3) where < SiSj >Qo is the expected value of SiSj when data is clamped on the visible units and the hidden states are sampled from their conditional distribution given the data, and Q ~ is the expected value of SiSj after prolonged Gibbs sampling. This learning rule does not work well because it can take a long time to approach thermal equilibrium and the sampling noise in the estimate of Q ~ can swamp the gradient. [1] shows that it is far more effective to minimize the difference between QOllQoo and Q111Qoo where Q1 is the distribution of the one-step reconstructions of the data that are produced by first picking binary hidden states from their conditional distribution given the data and then picking binary visible states from their conditional distribution given the hidden states. The exact gradient of this

3 0.11000451 62 nips-2000-Generalized Belief Propagation

Author: Jonathan S. Yedidia, William T. Freeman, Yair Weiss

Abstract: Belief propagation (BP) was only supposed to work for tree-like networks but works surprisingly well in many applications involving networks with loops, including turbo codes. However, there has been little understanding of the algorithm or the nature of the solutions it finds for general graphs. We show that BP can only converge to a stationary point of an approximate free energy, known as the Bethe free energy in statistical physics. This result characterizes BP fixed-points and makes connections with variational approaches to approximate inference. More importantly, our analysis lets us build on the progress made in statistical physics since Bethe's approximation was introduced in 1935. Kikuchi and others have shown how to construct more accurate free energy approximations, of which Bethe's approximation is the simplest. Exploiting the insights from our analysis, we derive generalized belief propagation (GBP) versions ofthese Kikuchi approximations. These new message passing algorithms can be significantly more accurate than ordinary BP, at an adjustable increase in complexity. We illustrate such a new GBP algorithm on a grid Markov network and show that it gives much more accurate marginal probabilities than those found using ordinary BP. 1

4 0.068498604 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

Author: Gal Elidan, Noam Lotner, Nir Friedman, Daphne Koller

Abstract: A serious problem in learning probabilistic models is the presence of hidden variables. These variables are not observed, yet interact with several of the observed variables. As such, they induce seemingly complex dependencies among the latter. In recent years, much attention has been devoted to the development of algorithms for learning parameters, and in some cases structure, in the presence of hidden variables. In this paper, we address the related problem of detecting hidden variables that interact with the observed variables. This problem is of interest both for improving our understanding of the domain and as a preliminary step that guides the learning procedure towards promising models. A very natural approach is to search for

5 0.062186703 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

Author: Yee Whye Teh, Geoffrey E. Hinton

Abstract: We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by finding the highest relative probability pair among all pairs that consist of a test image and an image whose identity is known. Our method compares favorably with other methods in the literature. The generative model consists of a single layer of rate-coded, non-linear feature detectors and it has the property that, given a data vector, the true posterior probability distribution over the feature detector activities can be inferred rapidly without iteration or approximation. The weights of the feature detectors are learned by comparing the correlations of pixel intensities and feature activations in two phases: When the network is observing real data and when it is observing reconstructions of real data generated from the feature activations.

6 0.060119543 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

7 0.058371071 6 nips-2000-A Neural Probabilistic Language Model

8 0.047847077 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

9 0.044956602 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

10 0.044586629 92 nips-2000-Occam's Razor

11 0.044021185 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

12 0.044006925 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

13 0.043442525 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

14 0.04339892 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

15 0.042758118 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems

16 0.042451903 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images

17 0.041187786 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

18 0.040365625 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

19 0.038385719 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

20 0.037752401 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.137), (1, -0.015), (2, 0.054), (3, -0.035), (4, 0.058), (5, -0.011), (6, 0.035), (7, -0.016), (8, 0.054), (9, 0.062), (10, 0.017), (11, -0.087), (12, 0.101), (13, -0.032), (14, 0.118), (15, 0.09), (16, -0.003), (17, 0.085), (18, 0.074), (19, -0.12), (20, 0.126), (21, 0.081), (22, -0.152), (23, -0.012), (24, 0.024), (25, -0.075), (26, -0.07), (27, 0.053), (28, -0.156), (29, -0.049), (30, -0.145), (31, -0.05), (32, -0.115), (33, -0.244), (34, -0.075), (35, 0.048), (36, -0.08), (37, -0.019), (38, 0.087), (39, -0.053), (40, -0.183), (41, -0.227), (42, -0.013), (43, -0.037), (44, -0.172), (45, 0.041), (46, -0.099), (47, -0.055), (48, 0.131), (49, -0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95719606 97 nips-2000-Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping

Author: Rich Caruana, Steve Lawrence, C. Lee Giles

Abstract: The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest two reasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity. 2) Regardless of size, nets learn task subcomponents in similar sequence. Big nets pass through stages similar to those learned by smaller nets. Early stopping can stop training the large net when it generalizes comparably to a smaller net. We also show that conjugate gradient can yield worse generalization because it overfits regions of low non-linearity when learning to fit regions of high non-linearity.

2 0.57815278 62 nips-2000-Generalized Belief Propagation

Author: Jonathan S. Yedidia, William T. Freeman, Yair Weiss

Abstract: Belief propagation (BP) was only supposed to work for tree-like networks but works surprisingly well in many applications involving networks with loops, including turbo codes. However, there has been little understanding of the algorithm or the nature of the solutions it finds for general graphs. We show that BP can only converge to a stationary point of an approximate free energy, known as the Bethe free energy in statistical physics. This result characterizes BP fixed-points and makes connections with variational approaches to approximate inference. More importantly, our analysis lets us build on the progress made in statistical physics since Bethe's approximation was introduced in 1935. Kikuchi and others have shown how to construct more accurate free energy approximations, of which Bethe's approximation is the simplest. Exploiting the insights from our analysis, we derive generalized belief propagation (GBP) versions ofthese Kikuchi approximations. These new message passing algorithms can be significantly more accurate than ordinary BP, at an adjustable increase in complexity. We illustrate such a new GBP algorithm on a grid Markov network and show that it gives much more accurate marginal probabilities than those found using ordinary BP. 1

3 0.4039861 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems

Author: Eiji Mizutani, James Demmel

Abstract: This paper describes a method of dogleg trust-region steps, or restricted Levenberg-Marquardt steps, based on a projection process onto the Krylov subspaces for neural networks nonlinear least squares problems. In particular, the linear conjugate gradient (CG) method works as the inner iterative algorithm for solving the linearized Gauss-Newton normal equation, whereas the outer nonlinear algorithm repeatedly takes so-called

4 0.35053408 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

Author: Guy Mayraz, Geoffrey E. Hinton

Abstract: The product of experts learning procedure [1] can discover a set of stochastic binary features that constitute a non-linear generative model of handwritten images of digits. The quality of generative models learned in this way can be assessed by learning a separate model for each class of digit and then comparing the unnormalized probabilities of test images under the 10 different class-specific models. To improve discriminative performance, it is helpful to learn a hierarchy of separate models for each digit class. Each model in the hierarchy has one layer of hidden units and the nth level model is trained on data that consists of the activities of the hidden units in the already trained (n - l)th level model. After training, each level produces a separate, unnormalized log probabilty score. With a three-level hierarchy for each of the 10 digit classes, a test image produces 30 scores which can be used as inputs to a supervised, logistic classification network that is trained on separate data. On the MNIST database, our system is comparable with current state-of-the-art discriminative methods, demonstrating that the product of experts learning procedure can produce effective generative models of high-dimensional data. 1 Learning products of stochastic binary experts Hinton [1] describes a learning algorithm for probabilistic generative models that are composed of a number of experts. Each expert specifies a probability distribution over the visible variables and the experts are combined by multiplying these distributions together and renormalizing. (1) where d is a data vector in a discrete space, Om is all the parameters of individual model m, Pm(dIOm) is the probability of d under model m, and c is an index over all possible vectors in the data space. A Restricted Boltzmann machine [2, 3] is a special case of a product of experts in which each expert is a single, binary stochastic hidden unit that has symmetrical connections to a set of visible units, and connections between the hidden units are forbidden. Inference in an RBM is much easier than in a general Boltzmann machine and it is also much easier than in a causal belief net because there is no explaining away. There is therefore no need to perform any iteration to determine the activities of the hidden units. The hidden states, Sj , are conditionally independent given the visible states, Si, and the distribution of Sj is given by the standard logistic function : 1 p(Sj = 1) = (2) 1 + exp( - Li WijSi) Conversely, the hidden states of an RBM are marginally dependent so it is easy for an RBM to learn population codes in which units may be highly correlated. It is hard to do this in causal belief nets with one hidden layer because the generative model of a causal belief net assumes marginal independence. An RBM can be trained using the standard Boltzmann machine learning algorithm which follows a noisy but unbiased estimate of the gradient of the log likelihood of the data. One way to implement this algorithm is to start the network with a data vector on the visible units and then to alternate between updating all of the hidden units in parallel and updating all of the visible units in parallel. Each update picks a binary state for a unit from its posterior distribution given the current states of all the units in the other set. If this alternating Gibbs sampling is run to equilibrium, there is a very simple way to update the weights so as to minimize the Kullback-Leibler divergence, QOIIQoo, between the data distribution, QO, and the equilibrium distribution of fantasies over the visible units, Qoo, produced by the RBM [4]: flWij oc QO - Q~ (3) where < SiSj >Qo is the expected value of SiSj when data is clamped on the visible units and the hidden states are sampled from their conditional distribution given the data, and Q ~ is the expected value of SiSj after prolonged Gibbs sampling. This learning rule does not work well because it can take a long time to approach thermal equilibrium and the sampling noise in the estimate of Q ~ can swamp the gradient. [1] shows that it is far more effective to minimize the difference between QOllQoo and Q111Qoo where Q1 is the distribution of the one-step reconstructions of the data that are produced by first picking binary hidden states from their conditional distribution given the data and then picking binary visible states from their conditional distribution given the hidden states. The exact gradient of this

5 0.32537401 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

Author: Yee Whye Teh, Geoffrey E. Hinton

Abstract: We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by finding the highest relative probability pair among all pairs that consist of a test image and an image whose identity is known. Our method compares favorably with other methods in the literature. The generative model consists of a single layer of rate-coded, non-linear feature detectors and it has the property that, given a data vector, the true posterior probability distribution over the feature detector activities can be inferred rapidly without iteration or approximation. The weights of the feature detectors are learned by comparing the correlations of pixel intensities and feature activations in two phases: When the network is observing real data and when it is observing reconstructions of real data generated from the feature activations.

6 0.29601249 143 nips-2000-Using the Nyström Method to Speed Up Kernel Machines

7 0.26198572 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

8 0.25771496 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

9 0.24031496 6 nips-2000-A Neural Probabilistic Language Model

10 0.23743461 92 nips-2000-Occam's Razor

11 0.22343215 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

12 0.2147651 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

13 0.21410933 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

14 0.21030279 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

15 0.19775243 54 nips-2000-Feature Selection for SVMs

16 0.19094746 34 nips-2000-Competition and Arbors in Ocular Dominance

17 0.1890696 52 nips-2000-Fast Training of Support Vector Classifiers

18 0.18482989 103 nips-2000-Probabilistic Semantic Video Indexing

19 0.18175386 131 nips-2000-The Early Word Catches the Weights

20 0.18149792 125 nips-2000-Stability and Noise in Biochemical Switches


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.021), (17, 0.078), (26, 0.01), (32, 0.019), (33, 0.054), (36, 0.013), (55, 0.047), (62, 0.037), (65, 0.044), (66, 0.35), (67, 0.055), (76, 0.048), (79, 0.022), (81, 0.032), (90, 0.034), (91, 0.013), (97, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84574455 97 nips-2000-Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping

Author: Rich Caruana, Steve Lawrence, C. Lee Giles

Abstract: The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest two reasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity. 2) Regardless of size, nets learn task subcomponents in similar sequence. Big nets pass through stages similar to those learned by smaller nets. Early stopping can stop training the large net when it generalizes comparably to a smaller net. We also show that conjugate gradient can yield worse generalization because it overfits regions of low non-linearity when learning to fit regions of high non-linearity.

2 0.36516818 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1

3 0.3612743 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

Author: Thomas Natschläger, Wolfgang Maass, Eduardo D. Sontag, Anthony M. Zador

Abstract: Experimental data show that biological synapses behave quite differently from the symbolic synapses in common artificial neural network models. Biological synapses are dynamic, i.e., their

4 0.34885815 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

5 0.3488535 146 nips-2000-What Can a Single Neuron Compute?

Author: Blaise Agüera y Arcas, Adrienne L. Fairhall, William Bialek

Abstract: In this paper we formulate a description of the computation performed by a neuron as a combination of dimensional reduction and nonlinearity. We implement this description for the HodgkinHuxley model, identify the most relevant dimensions and find the nonlinearity. A two dimensional description already captures a significant fraction of the information that spikes carry about dynamic inputs. This description also shows that computation in the Hodgkin-Huxley model is more complex than a simple integrateand-fire or perceptron model. 1

6 0.346715 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

7 0.3448877 74 nips-2000-Kernel Expansions with Unlabeled Examples

8 0.34381112 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

9 0.34355518 13 nips-2000-A Tighter Bound for Graphical Models

10 0.34163931 122 nips-2000-Sparse Representation for Gaussian Process Models

11 0.34130511 37 nips-2000-Convergence of Large Margin Separable Linear Classification

12 0.33797044 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

13 0.3373771 49 nips-2000-Explaining Away in Weight Space

14 0.33693042 111 nips-2000-Regularized Winnow Methods

15 0.3360379 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

16 0.33512807 52 nips-2000-Fast Training of Support Vector Classifiers

17 0.33456132 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

18 0.33452561 60 nips-2000-Gaussianization

19 0.33328968 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

20 0.33292717 127 nips-2000-Structure Learning in Human Causal Induction