nips nips2000 nips2000-35 knowledge-graph by maker-knowledge-mining

35 nips-2000-Computing with Finite and Infinite Networks

Source: pdf

Author: Ole Winther

Abstract: Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning. I find that a GP in general requires CJ (d S )-training examples to learn input features of order s (d is the input dimension), whereas a NN can learn the task with order the number of adjustable weights training examples. Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is important to limit the complexity of the learning machine. The theoretical findings are confirmed in simulations with analytical GP learning and a NN mean field algorithm.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 s e Abstract Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. [sent-4, score-0.285]

2 Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning. [sent-5, score-0.115]

3 I find that a GP in general requires CJ (d S )-training examples to learn input features of order s (d is the input dimension), whereas a NN can learn the task with order the number of adjustable weights training examples. [sent-6, score-0.602]

4 Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is important to limit the complexity of the learning machine. [sent-7, score-0.542]

5 The theoretical findings are confirmed in simulations with analytical GP learning and a NN mean field algorithm. [sent-8, score-0.279]

6 1 Introduction Non-parametric kernel methods such as Gaussian Processes (GPs) and Support Vector Machines (SVMs) are closely related to neural networks (NNs). [sent-9, score-0.204]

7 These may be considered as single layer networks in a possible infinite dimensional feature space. [sent-10, score-0.399]

8 Both the Bayesian GP approach and SVMs regularize the learning problem so that only a finite number of the features (dependent on the amount of data) is used. [sent-11, score-0.248]

9 Williams [2] has derived kernels allowing for efficient computation with both infinite feedforward and radial basis networks. [sent-13, score-0.447]

10 In this paper, I show that learning with a finite rather than infinite networks can make a profound difference by studying the case where the task to be learned is defined by a large but finite two-layer NN. [sent-14, score-0.826]

11 A theoretical analysis of the Bayesian approach to learning this task shows that the Bayesian student makes a learning transition from a linear model to specialized non-linear one when the number of examples is of the order of the number of adjustable weights in the network. [sent-15, score-0.609]

12 This effect-which is also seen in the simulations-is a consequence of the finite complexity of the network. [sent-16, score-0.224]

13 se /t f 2/ s t aff /winth e r / other hand such a transition will not occur. [sent-22, score-0.129]

14 It will eventually learn the task but it requires CJ( d S )-training examples to learn features of order s, where d is the input dimension. [sent-23, score-0.473]

15 However, the basic conclusions regarding learning with kernel methods and NNs turn out to be valid more generally, e. [sent-25, score-0.222]

16 I consider the usual Bayesian setup of supervised learning: A training set DN = {(Xi, y; ) Ii = 1 . [sent-28, score-0.128]

17 , N} (x E Rd and y E R) is known and the output for the new input x is predicted by the function f(x) which is sampled from the prior distribution of model outputs. [sent-31, score-0.291]

18 I will consider both a Gaussian process prior and the prior implied by a large (but finite) two-layer network. [sent-32, score-0.286]

19 The output noise is taken to be Gaussian, so the Likelihood becomes p(ylf(x)) = e - (Y- J(X))2 /2 /V27r(T2. [sent-33, score-0.08]

20 In this case, the model output prior is by definition Gaussian (2) where C is the covariance matrix. [sent-36, score-0.263]

21 The covariance matrix is computed from the kernel (covariance function) C(x, x'). [sent-37, score-0.176]

22 Below I give an explicit example corresponding to an infinite two-layer network. [sent-38, score-0.293]

23 Bayesian neural networks The output of the two-layer NN is given by f(x, w , W) = ~:: Wk (Wk . [sent-39, score-0.152]

24 x), where an especially convenient choice of transfer function in what JK follows is ( z ) = I~ dte- t2 /2/ V2ii. [sent-40, score-0.163]

25 I consider a Bayesian framework (with fixed known hyperparameters) with a weight prior that factorizes over hidden units p(w, W) = Ok [P(Wk )p(Wk)] and Gaussian input-to-hidden weights Wk ~ N(O, ~). [sent-41, score-0.379]

26 The prior over outputs for the Bayesian neural network is p(f) dwdWp(w , W) 0; J(J(x; ) - f(x ;, w , W)). [sent-43, score-0.146]

27 In the infinite hidden unit limit, J{ -+ 00, when P(Wk) has zero mean and finite, say unit variance, it follows from the central limit theorem (eLT) that the prior distribution converges to a Gaussian process f ~ N(O, C) with kernel [1,2] =I C(x, x') J dw p(w) (w . [sent-44, score-0.934]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wk', 0.37), ('nns', 0.333), ('gp', 0.313), ('infinite', 0.293), ('nn', 0.274), ('bayesian', 0.255), ('gps', 0.185), ('lund', 0.147), ('finite', 0.145), ('limit', 0.126), ('cj', 0.116), ('prior', 0.106), ('ef', 0.104), ('kernel', 0.099), ('adjustable', 0.096), ('gaussian', 0.091), ('learn', 0.086), ('svms', 0.081), ('output', 0.08), ('covariance', 0.077), ('units', 0.077), ('aff', 0.074), ('factorizes', 0.074), ('implied', 0.074), ('ole', 0.074), ('regressor', 0.074), ('solvegatan', 0.074), ('unpublished', 0.074), ('winther', 0.074), ('ylf', 0.074), ('networks', 0.072), ('hidden', 0.071), ('teacher', 0.067), ('student', 0.067), ('theoretical', 0.066), ('task', 0.065), ('specialized', 0.062), ('dw', 0.062), ('jk', 0.062), ('eventually', 0.058), ('hyperparameters', 0.058), ('argued', 0.058), ('studying', 0.058), ('sweden', 0.058), ('processes', 0.056), ('neal', 0.055), ('minus', 0.055), ('features', 0.055), ('transition', 0.055), ('besides', 0.052), ('transfer', 0.052), ('dn', 0.052), ('entirely', 0.052), ('williams', 0.052), ('examples', 0.051), ('weights', 0.051), ('confirmed', 0.05), ('ok', 0.05), ('radial', 0.048), ('setup', 0.048), ('learning', 0.048), ('rd', 0.046), ('mechanics', 0.046), ('convenient', 0.046), ('feedforward', 0.044), ('regarding', 0.044), ('lu', 0.044), ('findings', 0.043), ('reasons', 0.042), ('usual', 0.042), ('complexity', 0.041), ('unit', 0.04), ('network', 0.04), ('input', 0.04), ('consequence', 0.038), ('supervised', 0.038), ('th', 0.037), ('analytical', 0.037), ('believe', 0.036), ('mean', 0.035), ('considered', 0.034), ('predicted', 0.034), ('closely', 0.033), ('especially', 0.033), ('physics', 0.033), ('xi', 0.033), ('reason', 0.033), ('calculate', 0.033), ('requires', 0.032), ('follows', 0.032), ('kernels', 0.032), ('sampled', 0.031), ('valid', 0.031), ('error', 0.031), ('average', 0.031), ('allowing', 0.03), ('say', 0.03), ('curves', 0.03), ('applying', 0.03), ('converge', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 35 nips-2000-Computing with Finite and Infinite Networks

Author: Ole Winther

2 0.2718972 122 nips-2000-Sparse Representation for Gaussian Process Models

Author: Lehel Csatč´¸, Manfred Opper

Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.

3 0.15522847 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

Author: Dörthe Malzahn, Manfred Opper

Abstract: Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves for Gaussian process regression models. The approximation works well in the large sample size limit and for arbitrary dimensionality of the input space. We explain how the approximation can be systematically improved and argue that similar techniques can be applied to general likelihood models. 1

4 0.14739853 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

5 0.1451546 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1

6 0.1302284 92 nips-2000-Occam's Razor

7 0.12596858 75 nips-2000-Large Scale Bayes Point Machines

8 0.11004305 133 nips-2000-The Kernel Gibbs Sampler

9 0.10844705 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

10 0.10180818 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

11 0.10065592 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

12 0.093663119 121 nips-2000-Sparse Kernel Principal Component Analysis

13 0.080729231 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

14 0.07599631 134 nips-2000-The Kernel Trick for Distances

15 0.074776888 110 nips-2000-Regularization with Dot-Product Kernels

16 0.074151769 27 nips-2000-Automatic Choice of Dimensionality for PCA

17 0.0734929 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

18 0.072477303 85 nips-2000-Mixtures of Gaussian Processes

19 0.072315753 54 nips-2000-Feature Selection for SVMs

20 0.071967512 130 nips-2000-Text Classification using String Kernels

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.244), (1, 0.101), (2, 0.033), (3, -0.032), (4, 0.146), (5, 0.144), (6, -0.081), (7, 0.101), (8, 0.037), (9, -0.313), (10, 0.187), (11, 0.035), (12, 0.039), (13, -0.117), (14, 0.119), (15, 0.147), (16, -0.188), (17, 0.062), (18, 0.09), (19, -0.046), (20, 0.066), (21, -0.176), (22, 0.138), (23, 0.143), (24, 0.022), (25, 0.007), (26, -0.053), (27, -0.159), (28, -0.0), (29, 0.021), (30, -0.032), (31, -0.096), (32, 0.043), (33, -0.027), (34, 0.022), (35, 0.011), (36, 0.04), (37, -0.14), (38, -0.016), (39, -0.083), (40, -0.016), (41, -0.003), (42, 0.015), (43, -0.04), (44, -0.062), (45, -0.033), (46, -0.038), (47, -0.066), (48, -0.101), (49, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95992279 35 nips-2000-Computing with Finite and Infinite Networks

Author: Ole Winther

2 0.82599211 122 nips-2000-Sparse Representation for Gaussian Process Models

Author: Lehel Csatč´¸, Manfred Opper

3 0.56418568 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

Author: Dörthe Malzahn, Manfred Opper

4 0.47956365 85 nips-2000-Mixtures of Gaussian Processes

Author: Volker Tresp

Abstract: We introduce the mixture of Gaussian processes (MGP) model which is useful for applications in which the optimal bandwidth of a map is input dependent. The MGP is derived from the mixture of experts model and can also be used for modeling general conditional probability densities. We discuss how Gaussian processes -in particular in form of Gaussian process classification, the support vector machine and the MGP modelcan be used for quantifying the dependencies in graphical models.

5 0.45595908 92 nips-2000-Occam's Razor

Author: Carl Edward Rasmussen, Zoubin Ghahramani

Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.

6 0.44716388 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

7 0.42435896 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

8 0.38268709 133 nips-2000-The Kernel Gibbs Sampler

9 0.37725922 120 nips-2000-Sparse Greedy Gaussian Process Regression

10 0.37023476 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

11 0.36754426 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

12 0.35846281 75 nips-2000-Large Scale Bayes Point Machines

13 0.35743585 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

14 0.35156715 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

15 0.31067982 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

16 0.30921769 27 nips-2000-Automatic Choice of Dimensionality for PCA

17 0.29089695 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

18 0.28447932 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

19 0.27658021 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

20 0.25912619 54 nips-2000-Feature Selection for SVMs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.029), (17, 0.116), (33, 0.024), (54, 0.455), (55, 0.038), (62, 0.024), (67, 0.091), (76, 0.094), (79, 0.012), (90, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88036621 35 nips-2000-Computing with Finite and Infinite Networks

Author: Ole Winther

2 0.83383787 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

Author: Martin J. Wainwright, Erik B. Sudderth, Alan S. Willsky

Abstract: We present the embedded trees algorithm, an iterative technique for estimation of Gaussian processes defined on arbitrary graphs. By exactly solving a series of modified problems on embedded spanning trees, it computes the conditional means with an efficiency comparable to or better than other techniques. Unlike other methods, the embedded trees algorithm also computes exact error covariances. The error covariance computation is most efficient for graphs in which removing a small number of edges reveals an embedded tree. In this context, we demonstrate that sparse loopy graphs can provide a significant increase in modeling power relative to trees, with only a minor increase in estimation complexity. 1

3 0.64686579 10 nips-2000-A Productive, Systematic Framework for the Representation of Visual Structure

Author: Shimon Edelman, Nathan Intrator

Abstract: We describe a unified framework for the understanding of structure representation in primate vision. A model derived from this framework is shown to be effectively systematic in that it has the ability to interpret and associate together objects that are related through a rearrangement of common

4 0.54782075 122 nips-2000-Sparse Representation for Gaussian Process Models

Author: Lehel Csatč´¸, Manfred Opper

5 0.41379291 60 nips-2000-Gaussianization

Author: Scott Saobing Chen, Ramesh A. Gopinath

Abstract: High dimensional data modeling is difficult mainly because the so-called

6 0.38085979 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems

7 0.36817181 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

8 0.36538312 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

9 0.36444035 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

10 0.36331213 92 nips-2000-Occam's Razor

11 0.35411367 23 nips-2000-An Adaptive Metric Machine for Pattern Classification

12 0.35367239 85 nips-2000-Mixtures of Gaussian Processes