nips nips2000 nips2000-92 knowledge-graph by maker-knowledge-mining

92 nips-2000-Occam's Razor


Source: pdf

Author: Carl Edward Rasmussen, Zoubin Ghahramani

Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 dk Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London 17 Queen Square, London WCIN 3AR, England zoubin@gatsby . [sent-6, score-0.072]

2 uk Abstract The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. [sent-13, score-0.319]

3 The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. [sent-15, score-0.332]

4 We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work. [sent-16, score-0.268]

5 1 Introduction Occam's Razor is a well known principle of "parsimony of explanations" which is influential in scientific thinking in general and in problems of statistical inference in particular. [sent-17, score-0.128]

6 One might think that one has to build a prior over models which explicitly favours simpler models. [sent-19, score-0.315]

7 But as we will see, Occam's Razor is in fact embodied in the application of Bayesian theory. [sent-20, score-0.046]

8 We focus on complex models with large numbers of parameters which are often referred to as non-parametric. [sent-22, score-0.361]

9 We will use the term to refer to models in which we do not necessarily know the roles played by individual parameters, and inference is not primarily targeted at the parameters themselves, but rather at the predictions made by the models. [sent-23, score-0.487]

10 These types of models are typical for applications in machine learning. [sent-24, score-0.17]

11 From a non-Bayesian perspective, arguments are put forward for adjusting model complexity in the light of limited training data, to avoid over-fitting. [sent-25, score-0.326]

12 Model complexity is often regulated by adjusting the number offree parameters in the model and sometimes complexity is further constrained by the use of regularizers (such as weight decay). [sent-26, score-0.537]

13 If the model complexity is either too low or too high performance on an independent test set will suffer, giving rise to a characteristic Occam's Hill. [sent-27, score-0.24]

14 Typically an estimator of the generalization error or an independent validation set is used to control the model complexity. [sent-28, score-0.083]

15 From the Bayesian perspective, authors seem to take two conflicting stands on the question of model complexity. [sent-29, score-0.175]

16 One view is to infer the probability of the model for each of several different model sizes and use these probabilities when making predictions. [sent-30, score-0.243]

17 An alternate view suggests that we simply choose a "large enough" model and sidestep the problem of model size selection. [sent-31, score-0.243]

18 Note that both views assume that parameters are averaged over. [sent-32, score-0.18]

19 Example: Should we use Occam's Razor to determine the optimal number of hidden units in a neural network or should we simply use as many hidden units as possible computationally? [sent-33, score-0.204]

20 1 View 1: Model size selection One of the central quantities in Bayesian learning is the evidence, the probability of the data given the model P(YIM i ) computed as the integral over the parameters W of the likelihood times the prior. [sent-36, score-0.319]

21 The evidence is related to the probability of the model, P(MiIY) through Bayes rule: where it is not uncommon that the prior on models P(M i ) is flat, such that P(MiIY) is proportional to the evidence. [sent-37, score-0.505]

22 Figure 1 explains why the evidence discourages overcomplex models, and can be used to select l the most probable model. [sent-38, score-0.5]

23 It is also possible to understand how the evidence discourages overcomplex models and therefore embodies Occam's Razor by using the following interpretation. [sent-39, score-0.646]

24 The evidence is the probability that if you randomly selected parameter values from your model class, you would generate data set Y. [sent-40, score-0.357]

25 Models that are too simple will be very unlikely to generate that particular data set, whereas models that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. [sent-41, score-0.557]

26 2 View 2: Large models In non-parametric Bayesian models there is no statistical reason to constrain models, as long as our prior reflects our beliefs. [sent-43, score-0.434]

27 Therefore, we should consider models with as many parameters as we can handle computationally. [sent-47, score-0.255]

28 Neal [1996] showed how multilayer perceptrons with large numbers of hidden units achieved good performance on small data sets. [sent-48, score-0.169]

29 He used sophisticated MCMC techniques to implement averaging over parameters. [sent-49, score-0.082]

30 Following this line of thought there is no model complexity selection task: We don't need to evaluate evidence (which is often difficult) and we don't need or want to use Occam's Razor to limit the number of parameters in our model. [sent-50, score-0.697]

31 'We really ought to average together predictions from all models weighted by their probabilities. [sent-51, score-0.408]

32 However if the evidence is strongly peaked, or for practical reasons, we may want to select one as an approximation. [sent-52, score-0.303]

33 2Por some models, the limit of an infinite number of parameters is a simple model which can be treated tractably. [sent-53, score-0.396]

34 Two examples are the Gaussian Process limit of Bayesian neural networks [Neal, 1996], and the infinite limit of Gaussian mixture models [Rasmussen, 2000]. [sent-54, score-0.479]

35 too complex y All possible data sets Figure 1: Left panel: the evidence as a function of an abstract one dimensional representation of "all possible" datasets. [sent-55, score-0.239]

36 Because the evidence must "normalize", very complex models which can account for many data sets only achieve modest evidence; simple models can reach high evidences, but only for a limited set of data. [sent-56, score-0.619]

37 When a dataset Y is observed, the evidence can be used to select between model complexities. [sent-57, score-0.347]

38 Such selection cannot be done using just the likelihood, P(Y Iw, Mi). [sent-58, score-0.071]

39 Right panel: neural networks with different numbers of hidden unit form a family of models, posing the model selection problem. [sent-59, score-0.27]

40 For example, a Fourier model for scalar inputs has the form: D y(x) where w weights: = ao + Lad sin(dx) + bd cos(dx), d=l {ao,al,bl, . [sent-61, score-0.212]

41 ,aD,bD}' Assuming an independent Gaussian prior on the D p(wIS, c) ex: exp (- ~ [Coa~ + L cd(a~ + b~)]), d=l where S is an overall scale and Cd are precisions (inverse variances) for weights of order (frequency) d. [sent-64, score-0.367]

42 It is easy to show that Gaussian priors over weights imply Gaussian Process priors over functions 3 . [sent-65, score-0.22]

43 The covariance function for the corresponding Gaussian Process prior is: D K(x,x') = [Lcos(d(x-x'))/Cd]/S. [sent-66, score-0.094]

44 05 0 0 2 3 4 5 6 Model order 9 10 11 Figure 2: Top: 12 different model orders for the "unscaled" model: Cd ex 1. [sent-102, score-0.294]

45 The mean predictions are shown with a full line, the dashed and dotted lines limit the 50% and 95% central mass of the predictive distribution (which is student-t). [sent-103, score-0.221]

46 The probabilities of the models exhibit an Occam's Hill, discouraging models that are either "too small" or "too big". [sent-105, score-0.34]

47 1 Inference in the Fourier model Given data V = {x(n), y(n) In = 1, . [sent-107, score-0.083]

48 ,N} with independent Gaussian noise with precision T, the likelihood is: N p(Ylx, w, T) ex II exp (- ~[y(n) - WT n]2). [sent-110, score-0.22]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('occam', 0.468), ('razor', 0.383), ('evidence', 0.2), ('models', 0.17), ('limit', 0.121), ('carl', 0.118), ('discourages', 0.118), ('dtu', 0.118), ('imm', 0.118), ('miiy', 0.118), ('ought', 0.118), ('overcomplex', 0.118), ('cd', 0.103), ('rasmussen', 0.101), ('complexity', 0.098), ('fourier', 0.095), ('views', 0.095), ('bayesian', 0.095), ('prior', 0.094), ('zoubin', 0.092), ('scalar', 0.086), ('denmark', 0.085), ('parameters', 0.085), ('model', 0.083), ('priors', 0.081), ('view', 0.077), ('ylx', 0.076), ('adjusting', 0.076), ('neal', 0.076), ('ucl', 0.076), ('ex', 0.074), ('generate', 0.074), ('dk', 0.072), ('order', 0.071), ('selection', 0.071), ('gaussian', 0.071), ('put', 0.069), ('numbers', 0.067), ('infinite', 0.067), ('orders', 0.066), ('select', 0.064), ('panel', 0.063), ('unlikely', 0.063), ('predictions', 0.061), ('rise', 0.059), ('really', 0.059), ('perspective', 0.059), ('weights', 0.058), ('wt', 0.057), ('dx', 0.057), ('gatsby', 0.056), ('exp', 0.053), ('units', 0.053), ('precision', 0.052), ('regulated', 0.051), ('pursue', 0.051), ('favours', 0.051), ('reconciled', 0.051), ('yim', 0.051), ('hidden', 0.049), ('sometimes', 0.046), ('machinery', 0.046), ('big', 0.046), ('ware', 0.046), ('embodied', 0.046), ('targeted', 0.046), ('lyngby', 0.046), ('spiegelhalter', 0.046), ('stands', 0.046), ('conflicting', 0.046), ('precisions', 0.046), ('inference', 0.045), ('scale', 0.045), ('uk', 0.044), ('basis', 0.043), ('iw', 0.043), ('beliefs', 0.043), ('bd', 0.043), ('normalize', 0.043), ('mcmc', 0.043), ('smith', 0.043), ('splines', 0.043), ('thinking', 0.043), ('implement', 0.042), ('london', 0.041), ('proportional', 0.041), ('likelihood', 0.041), ('treated', 0.04), ('played', 0.04), ('roles', 0.04), ('hyperparameters', 0.04), ('sophisticated', 0.04), ('influential', 0.04), ('peaked', 0.04), ('embodies', 0.04), ('ff', 0.04), ('modest', 0.04), ('central', 0.039), ('want', 0.039), ('complex', 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 92 nips-2000-Occam's Razor

Author: Carl Edward Rasmussen, Zoubin Ghahramani

Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.

2 0.26863518 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

Author: Ilya Nemenman, William Bialek

Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x.}] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ .}) - (Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) est X X. (Q(Xl)Q(X2) ... Q(XN ))(0) , (2) where ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. Other definitions are also possible [6], but we think that most of our results do not depend on this choice. = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. These conditions imply that our field theory must be convergent at small length scales. For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective action [Eq. (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as ....., 1/ iN P( x). Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. However, both similarities are incomplete. The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. The algorithm to implement the discussed density estimation procedure numerically is rather simple. First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is then solved by a standard 'relaxation' (or Newton) method of iterative improvements to a guessed solution [13] (the target precision is always 10- 5 ). The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '

3 0.13307229 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1

4 0.1302284 35 nips-2000-Computing with Finite and Infinite Networks

Author: Ole Winther

Abstract: Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning. I find that a GP in general requires CJ (d S )-training examples to learn input features of order s (d is the input dimension), whereas a NN can learn the task with order the number of adjustable weights training examples. Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is important to limit the complexity of the learning machine. The theoretical findings are confirmed in simulations with analytical GP learning and a NN mean field algorithm.

5 0.10649859 27 nips-2000-Automatic Choice of Dimensionality for PCA

Author: Thomas P. Minka

Abstract: A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster.

6 0.10510831 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

7 0.093819439 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

8 0.093308382 122 nips-2000-Sparse Representation for Gaussian Process Models

9 0.082161747 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition

10 0.080007613 133 nips-2000-The Kernel Gibbs Sampler

11 0.071537934 114 nips-2000-Second Order Approximations for Probability Models

12 0.067618549 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

13 0.064448163 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

14 0.063650832 85 nips-2000-Mixtures of Gaussian Processes

15 0.06355647 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

16 0.063383497 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

17 0.06241934 75 nips-2000-Large Scale Bayes Point Machines

18 0.061381258 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

19 0.059201851 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

20 0.057346411 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.234), (1, 0.007), (2, 0.086), (3, -0.023), (4, 0.175), (5, 0.044), (6, -0.049), (7, 0.078), (8, 0.036), (9, -0.262), (10, 0.073), (11, 0.117), (12, 0.11), (13, -0.062), (14, 0.132), (15, 0.163), (16, -0.092), (17, -0.202), (18, -0.038), (19, -0.078), (20, 0.021), (21, 0.029), (22, -0.23), (23, -0.134), (24, 0.015), (25, -0.061), (26, 0.044), (27, -0.028), (28, 0.019), (29, 0.025), (30, -0.097), (31, 0.003), (32, 0.036), (33, 0.027), (34, 0.002), (35, -0.004), (36, 0.025), (37, 0.108), (38, 0.016), (39, 0.173), (40, 0.024), (41, -0.018), (42, -0.06), (43, -0.009), (44, -0.077), (45, -0.027), (46, -0.108), (47, 0.122), (48, -0.024), (49, -0.11)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95229173 92 nips-2000-Occam's Razor

Author: Carl Edward Rasmussen, Zoubin Ghahramani

Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.

2 0.85884386 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

Author: Ilya Nemenman, William Bialek

Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x.}] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ .}) - (Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) est X X. (Q(Xl)Q(X2) ... Q(XN ))(0) , (2) where ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. Other definitions are also possible [6], but we think that most of our results do not depend on this choice. = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. These conditions imply that our field theory must be convergent at small length scales. For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective action [Eq. (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as ....., 1/ iN P( x). Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. However, both similarities are incomplete. The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. The algorithm to implement the discussed density estimation procedure numerically is rather simple. First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is then solved by a standard 'relaxation' (or Newton) method of iterative improvements to a guessed solution [13] (the target precision is always 10- 5 ). The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '

3 0.61655313 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition

Author: In Jae Myung, Mark A. Pitt, Shaobo Zhang, Vijay Balasubramanian

Abstract: How should we decide among competing explanations of a cognitive process given limited observations? The problem of model selection is at the heart of progress in cognitive science. In this paper, Minimum Description Length (MDL) is introduced as a method for selecting among computational models of cognition. We also show that differential geometry provides an intuitive understanding of what drives model selection in MDL. Finally, adequacy of MDL is demonstrated in two areas of cognitive modeling. 1 Model Selection and Model Complexity The development and testing of computational models of cognitive processing are a central focus in cognitive science. A model embodies a solution to a problem whose adequacy is evaluated by its ability to mimic behavior by capturing the regularities underlying observed data. This enterprise of model selection is challenging because of the competing goals that must be satisfied. Traditionally, computational models of cognition have been compared using one of many goodness-of-fit measures. However, use of such a measure can result in the choice of a model that over-fits the data, one that captures idiosyncracies in the particular data set (i.e., noise) over and above the underlying regularities of interest. Such models are considered complex, in that the inherent flexibility in the model enables it to fit diverse patterns of data. As a group, they can be characterized as having many parameters that are combined in a highly nonlinear fashion in the model equation. They do not assume a single structure in the data. Rather, the model contains multiple structures; each obtained by finely tuning the parameter values of the model, and thus can fit a wide range of data patterns. In contrast, simple models, frequently with few parameters, assume a specific structure in the data, which will manifest itself as a narrow range of similar data patterns. Only when one of these patterns occurs will the model fit the data well. The problem of over-fitting data due to model complexity suggests that the goal of model selection should instead be to select the model that generalizes best to all data samples that arise from the same underlying regularity, thus capturing only the regularity, not the noise. To achieve this goal, the selection method must be sensitive to the complexity of a model. There are at least two independent dimensions of model complexity. They are the number of free parameters of a model and its functional form, which refers to the way the parameters are combined in the model equation. For instance, it seems unlikely that two one-parameter models, y = ex and y = x 9, are equally complex in their ability to fit data. The two dimensions of model complexity (number of parameters and functional form) and their interplay can improve a model's fit to the data, without necessarily improving generalizability. The trademark of a good model selection procedure, then, is its ability to satisfy two opposing goals. A model must be sufficiently complex to describe the data sample accurately, but without over-fitting the data and thus losing generalizability. To achieve this end, we need a theoretically well-justified measure of model complexity that takes into account the number of parameters and the functional form of a model. In this paper, we introduce Minimum Description Length (MDL) as an appropriate method of selecting among mathematical models of cognition. We also show that MDL has an elegant geometric interpretation that provides a clear, intuitive understanding of the meaning of complexity in MDL. Finally, application examples of MDL are presented in two areas of cognitive modeling. 1.1 Minimum Description Length The central thesis of model selection is the estimation of a model's generalizability. One approach to assessing generalizability is the Minimum Description Length (MDL) principle [1]. It provides a theoretically well-grounded measure of complexity that is sensitive to both dimensions of complexity and also lends itself to intuitive, geometric interpretations. MDL was developed within algorithmic coding theory to choose the model that permits the greatest compression of data. A model family f with parameters e assigns the likelihood f(yle) to a given set of observed data y . The full form of the MDL measure for such a model family is given below. MDL = -In! (yISA) + ~ln( ; )+ In f dS.jdetl(S) where SA is the parameter that maximizes the likelihood, k is the number of parameters in the model, N is the sample size and I(e) is the Fisher information matrix. MDL is the length in bits of the shortest possible code that describes the data with the help of a model. In the context of cognitive modeling, the model that minimizes MDL uncovers the greatest amount of regularity (i.e., knowledge) underlying the data and therefore should be selected. The first, maximized log likelihood term is the lack-of-fit measure, and the second and third terms constitute the intrinsic complexity of the model. In particular, the third term captures the effects of complexity due to functional form, reflected through I(e). We will call the latter two terms together the geometric complexity of the model, for reasons that will become clear in the remainder of this paper. MDL arises as a finite series of terms in an asymptotic expansion of the Bayesian posterior probability of a model given the data for a special form of the parameter prior density [2] . Hence in essence, minimization of MDL is equivalent to maximization of the Bayesian posterior probability. In this paper we present a geometric interpretation of MDL, as well as Bayesian model selection [3], that provides an elegant and intuitive framework for understanding model complexity, a central concept in model selection. 2 Differential Geometric Interpretation of MDL From a geometric perspective, a parametric model family of probability distributions forms a Riemannian manifold embedded in the space of all probability distributions [4]. Every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise to a hyper-surface in which

4 0.5310182 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

Author: Igor V. Cadez, Padhraic Smyth

Abstract: We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifically we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within firstorder as a function of model complexity. This general property of

5 0.52661264 27 nips-2000-Automatic Choice of Dimensionality for PCA

Author: Thomas P. Minka

Abstract: A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster.

6 0.48584783 35 nips-2000-Computing with Finite and Infinite Networks

7 0.47002381 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

8 0.45204896 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

9 0.3884477 133 nips-2000-The Kernel Gibbs Sampler

10 0.38154429 85 nips-2000-Mixtures of Gaussian Processes

11 0.36009249 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

12 0.35594314 97 nips-2000-Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping

13 0.34886241 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

14 0.34232324 122 nips-2000-Sparse Representation for Gaussian Process Models

15 0.32389462 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

16 0.31186444 16 nips-2000-Active Inference in Concept Learning

17 0.30358133 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

18 0.30106416 34 nips-2000-Competition and Arbors in Ocular Dominance

19 0.29335257 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach

20 0.2832132 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.022), (17, 0.106), (32, 0.029), (33, 0.053), (38, 0.312), (54, 0.013), (55, 0.03), (62, 0.059), (65, 0.012), (67, 0.078), (75, 0.016), (76, 0.076), (79, 0.06), (81, 0.012), (90, 0.047), (97, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86491328 101 nips-2000-Place Cells and Spatial Navigation Based on 2D Visual Feature Extraction, Path Integration, and Reinforcement Learning

Author: Angelo Arleo, Fabrizio Smeraldi, Stéphane Hug, Wulfram Gerstner

Abstract: We model hippocampal place cells and head-direction cells by combining allothetic (visual) and idiothetic (proprioceptive) stimuli. Visual input, provided by a video camera on a miniature robot, is preprocessed by a set of Gabor filters on 31 nodes of a log-polar retinotopic graph. Unsupervised Hebbian learning is employed to incrementally build a population of localized overlapping place fields. Place cells serve as basis functions for reinforcement learning. Experimental results for goal-oriented navigation of a mobile robot are presented.

same-paper 2 0.79692113 92 nips-2000-Occam's Razor

Author: Carl Edward Rasmussen, Zoubin Ghahramani

Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.

3 0.49038583 87 nips-2000-Modelling Spatial Recall, Mental Imagery and Neglect

Author: Suzanna Becker, Neil Burgess

Abstract: We present a computational model of the neural mechanisms in the parietal and temporal lobes that support spatial navigation, recall of scenes and imagery of the products of recall. Long term representations are stored in the hippocampus, and are associated with local spatial and object-related features in the parahippocampal region. Viewer-centered representations are dynamically generated from long term memory in the parietal part of the model. The model thereby simulates recall and imagery of locations and objects in complex environments. After parietal damage, the model exhibits hemispatial neglect in mental imagery that rotates with the imagined perspective of the observer, as in the famous Milan Square experiment [1]. Our model makes novel predictions for the neural representations in the parahippocampal and parietal regions and for behavior in healthy volunteers and neuropsychological patients.

4 0.488195 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1

5 0.48110303 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

Author: Simon Tong, Daphne Koller

Abstract: Bayesian networks are graphical representations of probability distributions. In virtually all of the work on learning these networks, the assumption is that we are presented with a data set consisting of randomly generated instances from the underlying distribution. In many situations, however, we also have the option of active learning, where we have the possibility of guiding the sampling process by querying for certain types of samples. This paper addresses the problem of estimating the parameters of Bayesian networks in an active learning setting. We provide a theoretical framework for this problem, and an algorithm that chooses which active learning queries to generate based on the model learned so far. We present experimental results showing that our active learning algorithm can significantly reduce the need for training data in many situations.

6 0.46991208 43 nips-2000-Dopamine Bonuses

7 0.46881181 74 nips-2000-Kernel Expansions with Unlabeled Examples

8 0.46841484 122 nips-2000-Sparse Representation for Gaussian Process Models

9 0.46347216 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

10 0.46262568 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

11 0.46023297 8 nips-2000-A New Model of Spatial Representation in Multimodal Brain Areas

12 0.45960826 79 nips-2000-Learning Segmentation by Random Walks

13 0.45819888 146 nips-2000-What Can a Single Neuron Compute?

14 0.45819664 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

15 0.4581157 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

16 0.45796603 60 nips-2000-Gaussianization

17 0.45789886 37 nips-2000-Convergence of Large Margin Separable Linear Classification

18 0.45663315 21 nips-2000-Algorithmic Stability and Generalization Performance

19 0.45612481 27 nips-2000-Automatic Choice of Dimensionality for PCA

20 0.45611155 133 nips-2000-The Kernel Gibbs Sampler