nips nips2000 nips2000-35 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ole Winther
Abstract: Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning. I find that a GP in general requires CJ (d S )-training examples to learn input features of order s (d is the input dimension), whereas a NN can learn the task with order the number of adjustable weights training examples. Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is important to limit the complexity of the learning machine. The theoretical findings are confirmed in simulations with analytical GP learning and a NN mean field algorithm.
Reference: text
sentIndex sentText sentNum sentScore
1 s e Abstract Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. [sent-4, score-0.285]
2 Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning. [sent-5, score-0.115]
3 I find that a GP in general requires CJ (d S )-training examples to learn input features of order s (d is the input dimension), whereas a NN can learn the task with order the number of adjustable weights training examples. [sent-6, score-0.602]
4 Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is important to limit the complexity of the learning machine. [sent-7, score-0.542]
5 The theoretical findings are confirmed in simulations with analytical GP learning and a NN mean field algorithm. [sent-8, score-0.279]
6 1 Introduction Non-parametric kernel methods such as Gaussian Processes (GPs) and Support Vector Machines (SVMs) are closely related to neural networks (NNs). [sent-9, score-0.204]
7 These may be considered as single layer networks in a possible infinite dimensional feature space. [sent-10, score-0.399]
8 Both the Bayesian GP approach and SVMs regularize the learning problem so that only a finite number of the features (dependent on the amount of data) is used. [sent-11, score-0.248]
9 Williams [2] has derived kernels allowing for efficient computation with both infinite feedforward and radial basis networks. [sent-13, score-0.447]
10 In this paper, I show that learning with a finite rather than infinite networks can make a profound difference by studying the case where the task to be learned is defined by a large but finite two-layer NN. [sent-14, score-0.826]
11 A theoretical analysis of the Bayesian approach to learning this task shows that the Bayesian student makes a learning transition from a linear model to specialized non-linear one when the number of examples is of the order of the number of adjustable weights in the network. [sent-15, score-0.609]
12 This effect-which is also seen in the simulations-is a consequence of the finite complexity of the network. [sent-16, score-0.224]
13 se /t f 2/ s t aff /winth e r / other hand such a transition will not occur. [sent-22, score-0.129]
14 It will eventually learn the task but it requires CJ( d S )-training examples to learn features of order s, where d is the input dimension. [sent-23, score-0.473]
15 However, the basic conclusions regarding learning with kernel methods and NNs turn out to be valid more generally, e. [sent-25, score-0.222]
16 I consider the usual Bayesian setup of supervised learning: A training set DN = {(Xi, y; ) Ii = 1 . [sent-28, score-0.128]
17 , N} (x E Rd and y E R) is known and the output for the new input x is predicted by the function f(x) which is sampled from the prior distribution of model outputs. [sent-31, score-0.291]
18 I will consider both a Gaussian process prior and the prior implied by a large (but finite) two-layer network. [sent-32, score-0.286]
19 The output noise is taken to be Gaussian, so the Likelihood becomes p(ylf(x)) = e - (Y- J(X))2 /2 /V27r(T2. [sent-33, score-0.08]
20 In this case, the model output prior is by definition Gaussian (2) where C is the covariance matrix. [sent-36, score-0.263]
21 The covariance matrix is computed from the kernel (covariance function) C(x, x'). [sent-37, score-0.176]
22 Below I give an explicit example corresponding to an infinite two-layer network. [sent-38, score-0.293]
23 Bayesian neural networks The output of the two-layer NN is given by f(x, w , W) = ~:: Wk (Wk . [sent-39, score-0.152]
24 x), where an especially convenient choice of transfer function in what JK follows is ( z ) = I~ dte- t2 /2/ V2ii. [sent-40, score-0.163]
25 I consider a Bayesian framework (with fixed known hyperparameters) with a weight prior that factorizes over hidden units p(w, W) = Ok [P(Wk )p(Wk)] and Gaussian input-to-hidden weights Wk ~ N(O, ~). [sent-41, score-0.379]
26 The prior over outputs for the Bayesian neural network is p(f) dwdWp(w , W) 0; J(J(x; ) - f(x ;, w , W)). [sent-43, score-0.146]
27 In the infinite hidden unit limit, J{ -+ 00, when P(Wk) has zero mean and finite, say unit variance, it follows from the central limit theorem (eLT) that the prior distribution converges to a Gaussian process f ~ N(O, C) with kernel [1,2] =I C(x, x') J dw p(w) (w . [sent-44, score-0.934]
wordName wordTfidf (topN-words)
[('wk', 0.37), ('nns', 0.333), ('gp', 0.313), ('infinite', 0.293), ('nn', 0.274), ('bayesian', 0.255), ('gps', 0.185), ('lund', 0.147), ('finite', 0.145), ('limit', 0.126), ('cj', 0.116), ('prior', 0.106), ('ef', 0.104), ('kernel', 0.099), ('adjustable', 0.096), ('gaussian', 0.091), ('learn', 0.086), ('svms', 0.081), ('output', 0.08), ('covariance', 0.077), ('units', 0.077), ('aff', 0.074), ('factorizes', 0.074), ('implied', 0.074), ('ole', 0.074), ('regressor', 0.074), ('solvegatan', 0.074), ('unpublished', 0.074), ('winther', 0.074), ('ylf', 0.074), ('networks', 0.072), ('hidden', 0.071), ('teacher', 0.067), ('student', 0.067), ('theoretical', 0.066), ('task', 0.065), ('specialized', 0.062), ('dw', 0.062), ('jk', 0.062), ('eventually', 0.058), ('hyperparameters', 0.058), ('argued', 0.058), ('studying', 0.058), ('sweden', 0.058), ('processes', 0.056), ('neal', 0.055), ('minus', 0.055), ('features', 0.055), ('transition', 0.055), ('besides', 0.052), ('transfer', 0.052), ('dn', 0.052), ('entirely', 0.052), ('williams', 0.052), ('examples', 0.051), ('weights', 0.051), ('confirmed', 0.05), ('ok', 0.05), ('radial', 0.048), ('setup', 0.048), ('learning', 0.048), ('rd', 0.046), ('mechanics', 0.046), ('convenient', 0.046), ('feedforward', 0.044), ('regarding', 0.044), ('lu', 0.044), ('findings', 0.043), ('reasons', 0.042), ('usual', 0.042), ('complexity', 0.041), ('unit', 0.04), ('network', 0.04), ('input', 0.04), ('consequence', 0.038), ('supervised', 0.038), ('th', 0.037), ('analytical', 0.037), ('believe', 0.036), ('mean', 0.035), ('considered', 0.034), ('predicted', 0.034), ('closely', 0.033), ('especially', 0.033), ('physics', 0.033), ('xi', 0.033), ('reason', 0.033), ('calculate', 0.033), ('requires', 0.032), ('follows', 0.032), ('kernels', 0.032), ('sampled', 0.031), ('valid', 0.031), ('error', 0.031), ('average', 0.031), ('allowing', 0.03), ('say', 0.03), ('curves', 0.03), ('applying', 0.03), ('converge', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 35 nips-2000-Computing with Finite and Infinite Networks
Author: Ole Winther
Abstract: Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning. I find that a GP in general requires CJ (d S )-training examples to learn input features of order s (d is the input dimension), whereas a NN can learn the task with order the number of adjustable weights training examples. Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is important to limit the complexity of the learning machine. The theoretical findings are confirmed in simulations with analytical GP learning and a NN mean field algorithm.
2 0.2718972 122 nips-2000-Sparse Representation for Gaussian Process Models
Author: Lehel Csatč´¸, Manfred Opper
Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.
3 0.15522847 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
Author: Dörthe Malzahn, Manfred Opper
Abstract: Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves for Gaussian process regression models. The approximation works well in the large sample size limit and for arbitrary dimensionality of the input space. We explain how the approximation can be systematically improved and argue that similar techniques can be applied to general likelihood models. 1
4 0.14739853 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
Author: Claudio Gentile
Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.
5 0.1451546 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
Author: Zoubin Ghahramani, Matthew J. Beal
Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1
6 0.1302284 92 nips-2000-Occam's Razor
7 0.12596858 75 nips-2000-Large Scale Bayes Point Machines
8 0.11004305 133 nips-2000-The Kernel Gibbs Sampler
9 0.10844705 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
10 0.10180818 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes
11 0.10065592 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
12 0.093663119 121 nips-2000-Sparse Kernel Principal Component Analysis
13 0.080729231 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
14 0.07599631 134 nips-2000-The Kernel Trick for Distances
15 0.074776888 110 nips-2000-Regularization with Dot-Product Kernels
16 0.074151769 27 nips-2000-Automatic Choice of Dimensionality for PCA
17 0.0734929 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
18 0.072477303 85 nips-2000-Mixtures of Gaussian Processes
19 0.072315753 54 nips-2000-Feature Selection for SVMs
20 0.071967512 130 nips-2000-Text Classification using String Kernels
topicId topicWeight
[(0, 0.244), (1, 0.101), (2, 0.033), (3, -0.032), (4, 0.146), (5, 0.144), (6, -0.081), (7, 0.101), (8, 0.037), (9, -0.313), (10, 0.187), (11, 0.035), (12, 0.039), (13, -0.117), (14, 0.119), (15, 0.147), (16, -0.188), (17, 0.062), (18, 0.09), (19, -0.046), (20, 0.066), (21, -0.176), (22, 0.138), (23, 0.143), (24, 0.022), (25, 0.007), (26, -0.053), (27, -0.159), (28, -0.0), (29, 0.021), (30, -0.032), (31, -0.096), (32, 0.043), (33, -0.027), (34, 0.022), (35, 0.011), (36, 0.04), (37, -0.14), (38, -0.016), (39, -0.083), (40, -0.016), (41, -0.003), (42, 0.015), (43, -0.04), (44, -0.062), (45, -0.033), (46, -0.038), (47, -0.066), (48, -0.101), (49, 0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.95992279 35 nips-2000-Computing with Finite and Infinite Networks
Author: Ole Winther
Abstract: Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning. I find that a GP in general requires CJ (d S )-training examples to learn input features of order s (d is the input dimension), whereas a NN can learn the task with order the number of adjustable weights training examples. Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is important to limit the complexity of the learning machine. The theoretical findings are confirmed in simulations with analytical GP learning and a NN mean field algorithm.
2 0.82599211 122 nips-2000-Sparse Representation for Gaussian Process Models
Author: Lehel Csatč´¸, Manfred Opper
Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.
3 0.56418568 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
Author: Dörthe Malzahn, Manfred Opper
Abstract: Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves for Gaussian process regression models. The approximation works well in the large sample size limit and for arbitrary dimensionality of the input space. We explain how the approximation can be systematically improved and argue that similar techniques can be applied to general likelihood models. 1
4 0.47956365 85 nips-2000-Mixtures of Gaussian Processes
Author: Volker Tresp
Abstract: We introduce the mixture of Gaussian processes (MGP) model which is useful for applications in which the optimal bandwidth of a map is input dependent. The MGP is derived from the mixture of experts model and can also be used for modeling general conditional probability densities. We discuss how Gaussian processes -in particular in form of Gaussian process classification, the support vector machine and the MGP modelcan be used for quantifying the dependencies in graphical models.
5 0.45595908 92 nips-2000-Occam's Razor
Author: Carl Edward Rasmussen, Zoubin Ghahramani
Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.
6 0.44716388 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
7 0.42435896 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
8 0.38268709 133 nips-2000-The Kernel Gibbs Sampler
9 0.37725922 120 nips-2000-Sparse Greedy Gaussian Process Regression
10 0.37023476 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes
11 0.36754426 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
12 0.35846281 75 nips-2000-Large Scale Bayes Point Machines
13 0.35743585 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
14 0.35156715 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
15 0.31067982 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
16 0.30921769 27 nips-2000-Automatic Choice of Dimensionality for PCA
17 0.29089695 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA
18 0.28447932 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
19 0.27658021 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
20 0.25912619 54 nips-2000-Feature Selection for SVMs
topicId topicWeight
[(10, 0.029), (17, 0.116), (33, 0.024), (54, 0.455), (55, 0.038), (62, 0.024), (67, 0.091), (76, 0.094), (79, 0.012), (90, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.88036621 35 nips-2000-Computing with Finite and Infinite Networks
Author: Ole Winther
Abstract: Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning. I find that a GP in general requires CJ (d S )-training examples to learn input features of order s (d is the input dimension), whereas a NN can learn the task with order the number of adjustable weights training examples. Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is important to limit the complexity of the learning machine. The theoretical findings are confirmed in simulations with analytical GP learning and a NN mean field algorithm.
2 0.83383787 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles
Author: Martin J. Wainwright, Erik B. Sudderth, Alan S. Willsky
Abstract: We present the embedded trees algorithm, an iterative technique for estimation of Gaussian processes defined on arbitrary graphs. By exactly solving a series of modified problems on embedded spanning trees, it computes the conditional means with an efficiency comparable to or better than other techniques. Unlike other methods, the embedded trees algorithm also computes exact error covariances. The error covariance computation is most efficient for graphs in which removing a small number of edges reveals an embedded tree. In this context, we demonstrate that sparse loopy graphs can provide a significant increase in modeling power relative to trees, with only a minor increase in estimation complexity. 1
3 0.64686579 10 nips-2000-A Productive, Systematic Framework for the Representation of Visual Structure
Author: Shimon Edelman, Nathan Intrator
Abstract: We describe a unified framework for the understanding of structure representation in primate vision. A model derived from this framework is shown to be effectively systematic in that it has the ability to interpret and associate together objects that are related through a rearrangement of common
4 0.54782075 122 nips-2000-Sparse Representation for Gaussian Process Models
Author: Lehel Csatč´¸, Manfred Opper
Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.
5 0.41379291 60 nips-2000-Gaussianization
Author: Scott Saobing Chen, Ramesh A. Gopinath
Abstract: High dimensional data modeling is difficult mainly because the so-called
7 0.36817181 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA
8 0.36538312 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
9 0.36444035 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
10 0.36331213 92 nips-2000-Occam's Razor
11 0.35411367 23 nips-2000-An Adaptive Metric Machine for Pattern Classification
12 0.35367239 85 nips-2000-Mixtures of Gaussian Processes
13 0.35239357 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
14 0.35150319 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets
15 0.35140908 134 nips-2000-The Kernel Trick for Distances
16 0.35098535 21 nips-2000-Algorithmic Stability and Generalization Performance
17 0.3482585 115 nips-2000-Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks
18 0.34560016 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
19 0.34439629 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
20 0.34325904 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling