nips nips2001 nips2001-79 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter Sollich
Abstract: Learning curves for Gaussian process regression are well understood when the 'student' model happens to match the 'teacher' (true data generation process). I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning with excessively strong smoothness assumptions can be particularly dangerous: For example, a student with a standard radial basis function covariance function will learn a rougher teacher function only logarithmically slowly. All predictions are confirmed by simulations. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract Learning curves for Gaussian process regression are well understood when the 'student' model happens to match the 'teacher' (true data generation process). [sent-6, score-0.177]
2 In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. [sent-8, score-1.229]
3 Learning with excessively strong smoothness assumptions can be particularly dangerous: For example, a student with a standard radial basis function covariance function will learn a rougher teacher function only logarithmically slowly. [sent-9, score-1.24]
4 1 Introduction There has in the last few years been a good deal of excitement about the use of Gaussian processes (GPs) as an alternative to feedforward networks [1]. [sent-11, score-0.088]
5 GPs make prior assumptions about the problem to be learned very transparent, and even though they are non-parametric models, inference- at least in the case of regression considered below- is relatively straightforward. [sent-12, score-0.182]
6 how many training examples are needed to achieve a certain level of generalization performance. [sent-15, score-0.154]
7 The typical (as opposed to worst case) behaviour is captured in the learning curve, which gives the average generalization error t as a function of the number of training examples n. [sent-16, score-0.302]
8 Good bounds and approximations for t(n) are now available [1, 2, 3, 4, 5], but these are mostly restricted to the case where the 'student' model exactly matches the true 'teacher' generating the datal. [sent-17, score-0.143]
9 In practice, such a match is unlikely, and so it is lThe exception is the elegant work of Malzahn and Opper [2], which uses a statistical physics framework to derive approximate learning curves that also apply for any fixed target function. [sent-18, score-0.186]
10 However, this framework has not yet to my knowledge been exploited to important to understand how GPs learn if there is some model mismatch. [sent-19, score-0.084]
11 In its simplest form , the regression problem is this: We are trying to learn a function B* which maps inputs x (real-valued vectors) to (real-valued scalar) outputs B*(x). [sent-21, score-0.322]
12 We are given a set of training data D , consisting of n input-output pairs (xl, yl) ; the training outputs yl may differ from the 'clean' teacher outputs B*(x l ) due to corruption by noise. [sent-22, score-1.057]
13 Given a test input x, we are then asked to come up with a prediction B(x), plus error bar, for the corresponding output B(x). [sent-23, score-0.136]
14 In a Bayesian setting, we do this by specifying a prior P(B) over hypothesis functions , and a likelihood P(DIB) with which each B could have generated the training data; from this we deduce the posterior distribution P(BID) ex P(DIB)P(B). [sent-24, score-0.384]
15 For a GP, the prior is defined directly over input-output functions B; this is simpler than for a Bayesian feedforward net since no weights are involved which would have to be integrated out. [sent-25, score-0.209]
16 Any B is uniquely determined by its output values B(x) for all x from the input domain, and for a GP, these are assumed to have a joint Gaussian distribution (hence the name). [sent-26, score-0.133]
17 If we set the means to zero as is commonly done, this distribution is fully specified by the covariance function (B(x)B(xl))o = C(X,XI). [sent-27, score-0.087]
18 The latter transparently encodes prior assumptions about the function to be learned. [sent-28, score-0.171]
19 Here I is a lengthscale parameter, corresponding directly to the distance in input space over which we expect significant variation in the function values. [sent-30, score-0.057]
20 1 There are good reviews on how inference with GPs works [1 , 6], so I only give a brief summary here. [sent-31, score-0.036]
21 The student assumes that outputs y are generated from the 'clean' values of a hypothesis function B(x) by adding Gaussian noise of xindependent variance (J2. [sent-32, score-0.644]
22 where in the last expression I have replaced the average over D by one over the training inputs since the outputs no longer appear. [sent-36, score-0.364]
23 If the student model matches the true teacher model, E and € coincide and give the Bayes error, i. [sent-37, score-1.022]
24 the best achievable (average) generalization performance for the given teacher. [sent-39, score-0.081]
25 I assume in what follows that the teacher is also a GP, but with a possibly different covariance function C* (x, x') and noise level (}";. [sent-40, score-0.597]
26 (3) for E to be simplified, since by exact analogy with the argument for the student posterior (()*(x )k iD = k* (x) TK :;-1 y , ((); (x) )O. [sent-42, score-0.579]
27 A more convenient starting point is obtained if (using Mercer's theorem) we decompose the covariance function into its eigenfunctions ¢i(X) and eigenvalues Ai, defined w. [sent-45, score-0.297]
28 the input distribution so that (C(x, X') ¢i (X') )x' = Ai¢i(X) with the corresponding normalization (¢i(X)¢j(x))x = bij. [sent-48, score-0.057]
29 Then 00 00 i=1 i=1 For simplicity I assume here that the student and teacher covariance functions have the same eigenfunctions (but different eigenvalues). [sent-49, score-1.091]
30 This is not as restrictive as it may seem; several examples are given below. [sent-50, score-0.034]
31 The averages over the test input x in (5) are now easily carried out: E . [sent-51, score-0.057]
32 for the last term we need ((k( x) k(x)T)lm)x = L AiAj¢i(Xl)(¢i(X)¢j (x))x¢j (xm) = L A7¢i(X l )¢i(Xm) i ij Introducing the diagonal eigenvalue matrix (A)ij = Aibij and the 'design matrix ' ( A2 T . [sent-53, score-0.074]
wordName wordTfidf (topN-words)
[('teacher', 0.441), ('student', 0.441), ('xid', 0.289), ('gps', 0.229), ('xl', 0.202), ('id', 0.183), ('yl', 0.158), ('outputs', 0.132), ('dib', 0.115), ('plateaux', 0.115), ('behaviour', 0.101), ('gp', 0.101), ('bid', 0.1), ('mismatched', 0.1), ('posterior', 0.091), ('covariance', 0.087), ('eigenfunctions', 0.085), ('curve', 0.084), ('inputs', 0.077), ('mismatch', 0.076), ('smoothness', 0.076), ('curves', 0.076), ('prior', 0.075), ('xm', 0.073), ('lm', 0.073), ('training', 0.072), ('tk', 0.07), ('aa', 0.061), ('regression', 0.061), ('matches', 0.059), ('bar', 0.058), ('input', 0.057), ('feedforward', 0.056), ('gaussian', 0.054), ('tr', 0.054), ('learn', 0.052), ('radial', 0.051), ('eigenvalues', 0.051), ('average', 0.051), ('malzahn', 0.05), ('sollich', 0.05), ('strand', 0.05), ('aiaj', 0.05), ('corruption', 0.05), ('teachers', 0.05), ('transparent', 0.05), ('transparently', 0.05), ('woodbury', 0.05), ('yib', 0.05), ('exact', 0.047), ('dangerous', 0.046), ('king', 0.046), ('logarithmically', 0.046), ('lthe', 0.046), ('prediction', 0.046), ('assumptions', 0.046), ('generalization', 0.045), ('true', 0.045), ('giving', 0.044), ('dropping', 0.043), ('ij', 0.042), ('joint', 0.042), ('defined', 0.041), ('london', 0.04), ('infinitely', 0.04), ('match', 0.04), ('hypothesis', 0.039), ('approximations', 0.039), ('deduce', 0.038), ('email', 0.038), ('level', 0.037), ('ai', 0.037), ('functions', 0.037), ('reviews', 0.036), ('subscript', 0.036), ('achievable', 0.036), ('coincide', 0.036), ('elegant', 0.036), ('confirmed', 0.036), ('squared', 0.035), ('name', 0.035), ('opper', 0.035), ('maxima', 0.035), ('universal', 0.034), ('exception', 0.034), ('uniquely', 0.034), ('restrictive', 0.034), ('college', 0.034), ('simplified', 0.033), ('asked', 0.033), ('decompose', 0.033), ('rough', 0.033), ('worst', 0.033), ('noise', 0.032), ('last', 0.032), ('rich', 0.032), ('specifying', 0.032), ('covariances', 0.032), ('mathematics', 0.032), ('exploited', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 79 nips-2001-Gaussian Process Regression with Mismatched Models
Author: Peter Sollich
Abstract: Learning curves for Gaussian process regression are well understood when the 'student' model happens to match the 'teacher' (true data generation process). I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning with excessively strong smoothness assumptions can be particularly dangerous: For example, a student with a standard radial basis function covariance function will learn a rougher teacher function only logarithmically slowly. All predictions are confirmed by simulations. 1
2 0.14092274 21 nips-2001-A Variational Approach to Learning Curves
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
3 0.13368493 76 nips-2001-Fast Parameter Estimation Using Green's Functions
Author: K. Wong, F. Li
Abstract: We propose a method for the fast estimation of hyperparameters in large networks, based on the linear response relation in the cavity method, and an empirical measurement of the Green's function. Simulation results show that it is efficient and precise, when compared with cross-validation and other techniques which require matrix inversion. 1
4 0.1227888 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
Author: Manfred Opper, Robert Urbanczik
Abstract: Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1
5 0.12104672 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
Author: Carl E. Rasmussen, Zoubin Ghahramani
Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets – thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.
6 0.079569861 136 nips-2001-On the Concentration of Spectral Properties
7 0.078257889 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity
8 0.069090091 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
9 0.068866268 58 nips-2001-Covariance Kernels from Bayesian Generative Models
10 0.062992595 35 nips-2001-Analysis of Sparse Bayesian Learning
11 0.061835401 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation
12 0.061035015 43 nips-2001-Bayesian time series classification
13 0.059010264 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
14 0.053074773 61 nips-2001-Distribution of Mutual Information
15 0.049037535 154 nips-2001-Products of Gaussians
16 0.048501447 1 nips-2001-(Not) Bounding the True Error
17 0.047039162 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
18 0.046213683 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
19 0.045341 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
20 0.045337129 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
topicId topicWeight
[(0, -0.168), (1, 0.006), (2, -0.01), (3, -0.063), (4, -0.015), (5, -0.06), (6, 0.104), (7, 0.024), (8, 0.006), (9, -0.001), (10, 0.027), (11, 0.113), (12, 0.024), (13, -0.187), (14, -0.037), (15, 0.116), (16, 0.02), (17, 0.009), (18, 0.066), (19, -0.028), (20, -0.053), (21, -0.141), (22, 0.059), (23, 0.121), (24, 0.056), (25, 0.018), (26, -0.016), (27, -0.124), (28, -0.042), (29, 0.201), (30, 0.086), (31, 0.077), (32, -0.049), (33, 0.008), (34, 0.088), (35, 0.081), (36, -0.097), (37, 0.023), (38, 0.0), (39, -0.061), (40, -0.046), (41, 0.164), (42, 0.017), (43, 0.02), (44, -0.054), (45, -0.048), (46, -0.025), (47, 0.008), (48, -0.042), (49, 0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.9459002 79 nips-2001-Gaussian Process Regression with Mismatched Models
Author: Peter Sollich
Abstract: Learning curves for Gaussian process regression are well understood when the 'student' model happens to match the 'teacher' (true data generation process). I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning with excessively strong smoothness assumptions can be particularly dangerous: For example, a student with a standard radial basis function covariance function will learn a rougher teacher function only logarithmically slowly. All predictions are confirmed by simulations. 1
2 0.77293193 76 nips-2001-Fast Parameter Estimation Using Green's Functions
Author: K. Wong, F. Li
Abstract: We propose a method for the fast estimation of hyperparameters in large networks, based on the linear response relation in the cavity method, and an empirical measurement of the Green's function. Simulation results show that it is efficient and precise, when compared with cross-validation and other techniques which require matrix inversion. 1
3 0.65558112 21 nips-2001-A Variational Approach to Learning Curves
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
4 0.55663061 35 nips-2001-Analysis of Sparse Bayesian Learning
Author: Anita C. Faul, Michael E. Tipping
Abstract: The recent introduction of the 'relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. Using a particular form of Gaussian parameter prior, 'learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. It is further shown that if a derived 'sparsity criterion' is satisfied, this maximum is exactly equivalent to 'pruning' the corresponding parameter from the model. 1
5 0.54542428 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity
Author: Lehel Csató, Manfred Opper, Ole Winther
Abstract: The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka’s expectation propagation. Lastly, we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification and density estimation with Gaussian processes and on an independent component analysis problem.
6 0.49853706 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
7 0.49147919 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
8 0.45894057 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
9 0.45792103 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons
10 0.42583907 154 nips-2001-Products of Gaussians
11 0.41326866 61 nips-2001-Distribution of Mutual Information
12 0.37672767 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
13 0.37642318 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network
14 0.34682861 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks
15 0.32955888 91 nips-2001-Improvisation and Learning
16 0.32697022 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
17 0.32168597 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation
18 0.32150772 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments
19 0.31119749 43 nips-2001-Bayesian time series classification
20 0.30939189 108 nips-2001-Learning Body Pose via Specialized Maps
topicId topicWeight
[(14, 0.03), (17, 0.04), (19, 0.022), (27, 0.145), (30, 0.05), (43, 0.256), (59, 0.038), (72, 0.119), (79, 0.08), (83, 0.019), (91, 0.117)]
simIndex simValue paperId paperTitle
1 0.92485774 158 nips-2001-Receptive field structure of flow detectors for heading perception
Author: J. A. Beintema, M. Lappe, Alexander C. Berg
Abstract: Observer translation relative to the world creates image flow that expands from the observer's direction of translation (heading) from which the observer can recover heading direction. Yet, the image flow is often more complex, depending on rotation of the eye, scene layout and translation velocity. A number of models [1-4] have been proposed on how the human visual system extracts heading from flow in a neurophysiologic ally plausible way. These models represent heading by a set of neurons that respond to large image flow patterns and receive input from motion sensed at different image locations. We analysed these models to determine the exact receptive field of these heading detectors. We find most models predict that, contrary to widespread believe, the contribut ing motion sensors have a preferred motion directed circularly rather than radially around the detector's preferred heading. Moreover, the results suggest to look for more refined structure within the circular flow, such as bi-circularity or local motion-opponency.
2 0.85724163 188 nips-2001-The Unified Propagation and Scaling Algorithm
Author: Yee W. Teh, Max Welling
Abstract: In this paper we will show that a restricted class of constrained minimum divergence problems, named generalized inference problems, can be solved by approximating the KL divergence with a Bethe free energy. The algorithm we derive is closely related to both loopy belief propagation and iterative scaling. This unified propagation and scaling algorithm reduces to a convergent alternative to loopy belief propagation when no constraints are present. Experiments show the viability of our algorithm.
same-paper 3 0.8536374 79 nips-2001-Gaussian Process Regression with Mismatched Models
Author: Peter Sollich
Abstract: Learning curves for Gaussian process regression are well understood when the 'student' model happens to match the 'teacher' (true data generation process). I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning with excessively strong smoothness assumptions can be particularly dangerous: For example, a student with a standard radial basis function covariance function will learn a rougher teacher function only logarithmically slowly. All predictions are confirmed by simulations. 1
4 0.66923732 192 nips-2001-Tree-based reparameterization for approximate inference on loopy graphs
Author: Martin J. Wainwright, Tommi Jaakkola, Alan S. Willsky
Abstract: We develop a tree-based reparameterization framework that provides a new conceptual view of a large class of iterative algorithms for computing approximate marginals in graphs with cycles. It includes belief propagation (BP), which can be reformulated as a very local form of reparameterization. More generally, we consider algorithms that perform exact computations over spanning trees of the full graph. On the practical side, we find that such tree reparameterization (TRP) algorithms have convergence properties superior to BP. The reparameterization perspective also provides a number of theoretical insights into approximate inference, including a new characterization of fixed points; and an invariance intrinsic to TRP /BP. These two properties enable us to analyze and bound the error between the TRP /BP approximations and the actual marginals. While our results arise naturally from the TRP perspective, most of them apply in an algorithm-independent manner to any local minimum of the Bethe free energy. Our results also have natural extensions to more structured approximations [e.g. , 1, 2]. 1
5 0.66899252 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
Author: Carlotta Domeniconi, Dimitrios Gunopulos
Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1
6 0.66540676 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
7 0.66189218 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
8 0.66124964 13 nips-2001-A Natural Policy Gradient
9 0.65886939 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
11 0.65716559 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
12 0.65711427 61 nips-2001-Distribution of Mutual Information
13 0.65607107 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
14 0.65448999 36 nips-2001-Approximate Dynamic Programming via Linear Programming
15 0.65244877 114 nips-2001-Learning from Infinite Data in Finite Time
16 0.65177786 155 nips-2001-Quantizing Density Estimators
17 0.65085757 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
18 0.65057719 58 nips-2001-Covariance Kernels from Bayesian Generative Models
19 0.65050459 8 nips-2001-A General Greedy Approximation Algorithm with Applications
20 0.65041149 135 nips-2001-On Spectral Clustering: Analysis and an algorithm