nips nips2000 nips2000-76 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ilya Nemenman, William Bialek
Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x.}] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ .}) - (Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) est X X. (Q(Xl)Q(X2) ... Q(XN ))(0) , (2) where ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. Other definitions are also possible [6], but we think that most of our results do not depend on this choice. = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. These conditions imply that our field theory must be convergent at small length scales. For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective action [Eq. (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as ....., 1/ iN P( x). Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. However, both similarities are incomplete. The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. The algorithm to implement the discussed density estimation procedure numerically is rather simple. First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is then solved by a standard 'relaxation' (or Newton) method of iterative improvements to a guessed solution [13] (the target precision is always 10- 5 ). The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '
Reference: text
sentIndex sentText sentNum sentScore
1 Learning continuous distributions: Simulations with field theoretic priors lIya Nemenman1 ,2 and William Bialek2 of Physics, Princeton University, Princeton, New Jersey 08544 2NEC Research Institute, 4 Independence Way, Princeton, New Jersey 08540 nemenman@research. [sent-1, score-0.235]
2 com 1 Department Abstract Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. [sent-7, score-0.4]
3 We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. [sent-8, score-0.746]
4 Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. [sent-9, score-0.993]
5 One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. [sent-10, score-0.052]
6 This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. [sent-12, score-0.299]
7 In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. [sent-13, score-0.091]
8 It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. [sent-14, score-0.527]
9 A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). [sent-15, score-0.906]
10 In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. [sent-16, score-0.419]
11 These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. [sent-17, score-0.088]
12 [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? [sent-19, score-0.422]
13 Is the QFT inference optimal in using alJ of the information relevant for learning [II]? [sent-20, score-0.047]
14 What happens if our learning problem is strongly atypical of the prior distribution? [sent-21, score-0.202]
15 N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x. [sent-29, score-0.152]
16 }] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. [sent-30, score-0.091]
17 Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ . [sent-31, score-0.141]
18 Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. [sent-43, score-0.118]
19 = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. [sent-45, score-0.303]
20 We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. [sent-46, score-0.193]
21 We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. [sent-47, score-0.056]
22 These conditions imply that our field theory must be convergent at small length scales. [sent-48, score-0.191]
23 For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. [sent-49, score-0.431]
24 We refer to i and 'T/ as the smoothness scale and the exponent, respectively. [sent-50, score-0.329]
25 In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. [sent-51, score-0.119]
26 (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). [sent-53, score-0.187]
27 (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as . [sent-55, score-0.298]
28 Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. [sent-61, score-0.666]
29 J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. [sent-62, score-0.132]
30 This impression is produced by a Gaussian form of the smoothness penalty in Eq. [sent-63, score-0.39]
31 (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. [sent-64, score-0.862]
32 The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. [sent-66, score-0.434]
33 (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. [sent-67, score-0.046]
34 In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. [sent-68, score-0.624]
35 In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. [sent-69, score-0.271]
36 The algorithm to implement the discussed density estimation procedure numerically is rather simple. [sent-70, score-0.32]
37 First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. [sent-71, score-0.18]
38 The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. [sent-74, score-0.046]
39 We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '" Jl/NP(x) , which can be rather small for large N or smalll. [sent-81, score-0.183]
40 Since the theory is short scale insensitive, we can generate random probability densities chosen from the prior by replacing ¢ with its Fourier series and truncating the latter at some sufficiently high wavenumber kc [k c = 1000 for Figs. [sent-82, score-0.469]
41 (3) enforces the amplitude of the k'th mode to be distributed a priori normally with the standard deviation 21 / 2 uk = l'l/-1/2 (L) '1/ 27rk (8) Coded in such a way, the simulations are extremely computationally intensive. [sent-90, score-0.33]
42 Therefore, Monte Carlo averagings given here are only over 500 runs, fluctuation determinants are calculated according to Eq. [sent-91, score-0.34]
43 We see that singularities and overfitting are absent even for N as low as 10. [sent-97, score-0.056]
44 Moreover, the approach of Qel(X) to the actual distribution P(x) is remarkably fast: for N = 10, they are similar; for N = 1000, very close; for N = 100000, one needs to look carefully to see the difference between the two. [sent-98, score-0.369]
45 To quantify this similarity of distributions, we compute the Kullback-Leibler divergence DKL(PIIQest) between the true distribution P(x) and its estimate Qest(x), and then average over the realizations of the data points and the true distribution. [sent-99, score-0.056]
46 As discussed in [11], this learning curve A(N) measures the (average) excess cost incurred in coding the N + 1 'st data point because of the finiteness of the data sample, and thus can be called the "universalleaming curve". [sent-100, score-0.164]
47 If the inference algorithm uses all of the information contained in the data that is relevant for learning ("predictive information" [11]), then [5, 9, 11, 10] A(N) '" (L/l)1/2'1/N1/2'1/- 1. [sent-101, score-0.047]
48 (9) We test this prediction against the learning curves in the actual simulations. [sent-102, score-0.107]
49 One sees that the exponents are extremely close to the expected 1/2, and the ratios of the prefactors are within the errors from the predicted scaling'" 1/Vi. [sent-109, score-0.279]
50 All of this means that the proposed algorithm for finding densities not only works, but is at most a constant factor away from being optimal in using the predictive information of the sample set. [sent-110, score-0.228]
51 Next we investigate how one's choice of the prior influences learning. [sent-111, score-0.198]
52 We first stress that there is no such thing as a wrong prior. [sent-112, score-0.237]
53 If one admits a possibility of it being wrong, then (a) (b) 3. [sent-113, score-0.062]
54 5 3 10' Fit for 10 samples Fit for 1000 samples - - - Fit for 100000 samples Actual distribution - e~ , . [sent-114, score-0.207]
wordName wordTfidf (topN-words)
[('fluctuation', 0.216), ('occam', 0.208), ('nonparametric', 0.208), ('smoothness', 0.202), ('razor', 0.156), ('remarkably', 0.147), ('qft', 0.144), ('prior', 0.141), ('penalty', 0.132), ('fluctuations', 0.132), ('fit', 0.127), ('scale', 0.127), ('wrong', 0.126), ('determinants', 0.124), ('field', 0.118), ('theoretic', 0.117), ('princeton', 0.117), ('mdl', 0.112), ('quantum', 0.112), ('infinite', 0.11), ('actual', 0.107), ('extremely', 0.099), ('determinant', 0.098), ('goodness', 0.092), ('jersey', 0.092), ('enforces', 0.092), ('priori', 0.091), ('normalization', 0.089), ('implications', 0.088), ('factor', 0.084), ('rr', 0.084), ('density', 0.08), ('predictive', 0.078), ('dimensional', 0.074), ('theory', 0.073), ('rise', 0.072), ('numerically', 0.072), ('samples', 0.069), ('tj', 0.068), ('gaussian', 0.067), ('estimation', 0.066), ('cl', 0.066), ('smooth', 0.066), ('densities', 0.066), ('nonsingular', 0.062), ('balancing', 0.062), ('dkl', 0.062), ('est', 0.062), ('admits', 0.062), ('confine', 0.062), ('exponent', 0.062), ('exponents', 0.062), ('gaussianity', 0.062), ('posed', 0.062), ('prefactors', 0.062), ('self', 0.062), ('thing', 0.062), ('truncating', 0.062), ('curve', 0.061), ('happens', 0.061), ('look', 0.059), ('choice', 0.057), ('boundary', 0.056), ('carefully', 0.056), ('singularities', 0.056), ('noticing', 0.056), ('everywhere', 0.056), ('realizations', 0.056), ('dq', 0.056), ('impression', 0.056), ('sees', 0.056), ('spaced', 0.056), ('xi', 0.055), ('classical', 0.055), ('minimal', 0.055), ('finite', 0.052), ('select', 0.052), ('logarithm', 0.052), ('discretization', 0.052), ('route', 0.052), ('newton', 0.052), ('excess', 0.052), ('complexity', 0.052), ('discussed', 0.051), ('implement', 0.051), ('bayesian', 0.05), ('coded', 0.049), ('parameterization', 0.049), ('insensitive', 0.049), ('stress', 0.049), ('exp', 0.049), ('works', 0.048), ('simulations', 0.048), ('relevant', 0.047), ('discretized', 0.046), ('regularized', 0.046), ('el', 0.046), ('violates', 0.046), ('saddle', 0.046), ('competition', 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
Author: Ilya Nemenman, William Bialek
Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x.}] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ .}) - (Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) est X X. (Q(Xl)Q(X2) ... Q(XN ))(0) , (2) where ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. Other definitions are also possible [6], but we think that most of our results do not depend on this choice. = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. These conditions imply that our field theory must be convergent at small length scales. For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective action [Eq. (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as ....., 1/ iN P( x). Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. However, both similarities are incomplete. The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. The algorithm to implement the discussed density estimation procedure numerically is rather simple. First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is then solved by a standard 'relaxation' (or Newton) method of iterative improvements to a guessed solution [13] (the target precision is always 10- 5 ). The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '
2 0.26863518 92 nips-2000-Occam's Razor
Author: Carl Edward Rasmussen, Zoubin Ghahramani
Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.
3 0.13914445 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
Author: In Jae Myung, Mark A. Pitt, Shaobo Zhang, Vijay Balasubramanian
Abstract: How should we decide among competing explanations of a cognitive process given limited observations? The problem of model selection is at the heart of progress in cognitive science. In this paper, Minimum Description Length (MDL) is introduced as a method for selecting among computational models of cognition. We also show that differential geometry provides an intuitive understanding of what drives model selection in MDL. Finally, adequacy of MDL is demonstrated in two areas of cognitive modeling. 1 Model Selection and Model Complexity The development and testing of computational models of cognitive processing are a central focus in cognitive science. A model embodies a solution to a problem whose adequacy is evaluated by its ability to mimic behavior by capturing the regularities underlying observed data. This enterprise of model selection is challenging because of the competing goals that must be satisfied. Traditionally, computational models of cognition have been compared using one of many goodness-of-fit measures. However, use of such a measure can result in the choice of a model that over-fits the data, one that captures idiosyncracies in the particular data set (i.e., noise) over and above the underlying regularities of interest. Such models are considered complex, in that the inherent flexibility in the model enables it to fit diverse patterns of data. As a group, they can be characterized as having many parameters that are combined in a highly nonlinear fashion in the model equation. They do not assume a single structure in the data. Rather, the model contains multiple structures; each obtained by finely tuning the parameter values of the model, and thus can fit a wide range of data patterns. In contrast, simple models, frequently with few parameters, assume a specific structure in the data, which will manifest itself as a narrow range of similar data patterns. Only when one of these patterns occurs will the model fit the data well. The problem of over-fitting data due to model complexity suggests that the goal of model selection should instead be to select the model that generalizes best to all data samples that arise from the same underlying regularity, thus capturing only the regularity, not the noise. To achieve this goal, the selection method must be sensitive to the complexity of a model. There are at least two independent dimensions of model complexity. They are the number of free parameters of a model and its functional form, which refers to the way the parameters are combined in the model equation. For instance, it seems unlikely that two one-parameter models, y = ex and y = x 9, are equally complex in their ability to fit data. The two dimensions of model complexity (number of parameters and functional form) and their interplay can improve a model's fit to the data, without necessarily improving generalizability. The trademark of a good model selection procedure, then, is its ability to satisfy two opposing goals. A model must be sufficiently complex to describe the data sample accurately, but without over-fitting the data and thus losing generalizability. To achieve this end, we need a theoretically well-justified measure of model complexity that takes into account the number of parameters and the functional form of a model. In this paper, we introduce Minimum Description Length (MDL) as an appropriate method of selecting among mathematical models of cognition. We also show that MDL has an elegant geometric interpretation that provides a clear, intuitive understanding of the meaning of complexity in MDL. Finally, application examples of MDL are presented in two areas of cognitive modeling. 1.1 Minimum Description Length The central thesis of model selection is the estimation of a model's generalizability. One approach to assessing generalizability is the Minimum Description Length (MDL) principle [1]. It provides a theoretically well-grounded measure of complexity that is sensitive to both dimensions of complexity and also lends itself to intuitive, geometric interpretations. MDL was developed within algorithmic coding theory to choose the model that permits the greatest compression of data. A model family f with parameters e assigns the likelihood f(yle) to a given set of observed data y . The full form of the MDL measure for such a model family is given below. MDL = -In! (yISA) + ~ln( ; )+ In f dS.jdetl(S) where SA is the parameter that maximizes the likelihood, k is the number of parameters in the model, N is the sample size and I(e) is the Fisher information matrix. MDL is the length in bits of the shortest possible code that describes the data with the help of a model. In the context of cognitive modeling, the model that minimizes MDL uncovers the greatest amount of regularity (i.e., knowledge) underlying the data and therefore should be selected. The first, maximized log likelihood term is the lack-of-fit measure, and the second and third terms constitute the intrinsic complexity of the model. In particular, the third term captures the effects of complexity due to functional form, reflected through I(e). We will call the latter two terms together the geometric complexity of the model, for reasons that will become clear in the remainder of this paper. MDL arises as a finite series of terms in an asymptotic expansion of the Bayesian posterior probability of a model given the data for a special form of the parameter prior density [2] . Hence in essence, minimization of MDL is equivalent to maximization of the Bayesian posterior probability. In this paper we present a geometric interpretation of MDL, as well as Bayesian model selection [3], that provides an elegant and intuitive framework for understanding model complexity, a central concept in model selection. 2 Differential Geometric Interpretation of MDL From a geometric perspective, a parametric model family of probability distributions forms a Riemannian manifold embedded in the space of all probability distributions [4]. Every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise to a hyper-surface in which
4 0.12788956 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
Author: Dörthe Malzahn, Manfred Opper
Abstract: Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves for Gaussian process regression models. The approximation works well in the large sample size limit and for arbitrary dimensionality of the input space. We explain how the approximation can be systematically improved and argue that similar techniques can be applied to general likelihood models. 1
5 0.10844705 35 nips-2000-Computing with Finite and Infinite Networks
Author: Ole Winther
Abstract: Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning. I find that a GP in general requires CJ (d S )-training examples to learn input features of order s (d is the input dimension), whereas a NN can learn the task with order the number of adjustable weights training examples. Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is important to limit the complexity of the learning machine. The theoretical findings are confirmed in simulations with analytical GP learning and a NN mean field algorithm.
6 0.09403237 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
7 0.091441073 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
8 0.088841401 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
9 0.084688626 27 nips-2000-Automatic Choice of Dimensionality for PCA
10 0.083005868 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals
11 0.080139771 122 nips-2000-Sparse Representation for Gaussian Process Models
12 0.078199729 146 nips-2000-What Can a Single Neuron Compute?
13 0.074773699 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method
14 0.071620323 120 nips-2000-Sparse Greedy Gaussian Process Regression
15 0.069564372 13 nips-2000-A Tighter Bound for Graphical Models
16 0.06823878 114 nips-2000-Second Order Approximations for Probability Models
17 0.067552783 47 nips-2000-Error-correcting Codes on a Bethe-like Lattice
18 0.066462763 82 nips-2000-Learning and Tracking Cyclic Human Motion
19 0.065406911 74 nips-2000-Kernel Expansions with Unlabeled Examples
20 0.064822681 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
topicId topicWeight
[(0, 0.259), (1, 0.007), (2, 0.058), (3, -0.03), (4, 0.146), (5, 0.036), (6, -0.038), (7, 0.101), (8, -0.038), (9, -0.233), (10, -0.045), (11, 0.195), (12, 0.079), (13, -0.017), (14, 0.103), (15, 0.14), (16, -0.111), (17, -0.279), (18, -0.076), (19, -0.113), (20, 0.014), (21, 0.07), (22, -0.28), (23, -0.075), (24, 0.017), (25, -0.082), (26, 0.035), (27, -0.045), (28, -0.008), (29, -0.089), (30, -0.062), (31, 0.024), (32, 0.02), (33, 0.022), (34, 0.068), (35, -0.012), (36, -0.012), (37, 0.145), (38, -0.022), (39, 0.136), (40, -0.038), (41, -0.081), (42, -0.082), (43, -0.04), (44, -0.013), (45, 0.052), (46, -0.082), (47, 0.019), (48, -0.017), (49, -0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.96013367 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
Author: Ilya Nemenman, William Bialek
Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x.}] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ .}) - (Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) est X X. (Q(Xl)Q(X2) ... Q(XN ))(0) , (2) where ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. Other definitions are also possible [6], but we think that most of our results do not depend on this choice. = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. These conditions imply that our field theory must be convergent at small length scales. For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective action [Eq. (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as ....., 1/ iN P( x). Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. However, both similarities are incomplete. The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. The algorithm to implement the discussed density estimation procedure numerically is rather simple. First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is then solved by a standard 'relaxation' (or Newton) method of iterative improvements to a guessed solution [13] (the target precision is always 10- 5 ). The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '
2 0.84669083 92 nips-2000-Occam's Razor
Author: Carl Edward Rasmussen, Zoubin Ghahramani
Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.
3 0.69421363 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
Author: In Jae Myung, Mark A. Pitt, Shaobo Zhang, Vijay Balasubramanian
Abstract: How should we decide among competing explanations of a cognitive process given limited observations? The problem of model selection is at the heart of progress in cognitive science. In this paper, Minimum Description Length (MDL) is introduced as a method for selecting among computational models of cognition. We also show that differential geometry provides an intuitive understanding of what drives model selection in MDL. Finally, adequacy of MDL is demonstrated in two areas of cognitive modeling. 1 Model Selection and Model Complexity The development and testing of computational models of cognitive processing are a central focus in cognitive science. A model embodies a solution to a problem whose adequacy is evaluated by its ability to mimic behavior by capturing the regularities underlying observed data. This enterprise of model selection is challenging because of the competing goals that must be satisfied. Traditionally, computational models of cognition have been compared using one of many goodness-of-fit measures. However, use of such a measure can result in the choice of a model that over-fits the data, one that captures idiosyncracies in the particular data set (i.e., noise) over and above the underlying regularities of interest. Such models are considered complex, in that the inherent flexibility in the model enables it to fit diverse patterns of data. As a group, they can be characterized as having many parameters that are combined in a highly nonlinear fashion in the model equation. They do not assume a single structure in the data. Rather, the model contains multiple structures; each obtained by finely tuning the parameter values of the model, and thus can fit a wide range of data patterns. In contrast, simple models, frequently with few parameters, assume a specific structure in the data, which will manifest itself as a narrow range of similar data patterns. Only when one of these patterns occurs will the model fit the data well. The problem of over-fitting data due to model complexity suggests that the goal of model selection should instead be to select the model that generalizes best to all data samples that arise from the same underlying regularity, thus capturing only the regularity, not the noise. To achieve this goal, the selection method must be sensitive to the complexity of a model. There are at least two independent dimensions of model complexity. They are the number of free parameters of a model and its functional form, which refers to the way the parameters are combined in the model equation. For instance, it seems unlikely that two one-parameter models, y = ex and y = x 9, are equally complex in their ability to fit data. The two dimensions of model complexity (number of parameters and functional form) and their interplay can improve a model's fit to the data, without necessarily improving generalizability. The trademark of a good model selection procedure, then, is its ability to satisfy two opposing goals. A model must be sufficiently complex to describe the data sample accurately, but without over-fitting the data and thus losing generalizability. To achieve this end, we need a theoretically well-justified measure of model complexity that takes into account the number of parameters and the functional form of a model. In this paper, we introduce Minimum Description Length (MDL) as an appropriate method of selecting among mathematical models of cognition. We also show that MDL has an elegant geometric interpretation that provides a clear, intuitive understanding of the meaning of complexity in MDL. Finally, application examples of MDL are presented in two areas of cognitive modeling. 1.1 Minimum Description Length The central thesis of model selection is the estimation of a model's generalizability. One approach to assessing generalizability is the Minimum Description Length (MDL) principle [1]. It provides a theoretically well-grounded measure of complexity that is sensitive to both dimensions of complexity and also lends itself to intuitive, geometric interpretations. MDL was developed within algorithmic coding theory to choose the model that permits the greatest compression of data. A model family f with parameters e assigns the likelihood f(yle) to a given set of observed data y . The full form of the MDL measure for such a model family is given below. MDL = -In! (yISA) + ~ln( ; )+ In f dS.jdetl(S) where SA is the parameter that maximizes the likelihood, k is the number of parameters in the model, N is the sample size and I(e) is the Fisher information matrix. MDL is the length in bits of the shortest possible code that describes the data with the help of a model. In the context of cognitive modeling, the model that minimizes MDL uncovers the greatest amount of regularity (i.e., knowledge) underlying the data and therefore should be selected. The first, maximized log likelihood term is the lack-of-fit measure, and the second and third terms constitute the intrinsic complexity of the model. In particular, the third term captures the effects of complexity due to functional form, reflected through I(e). We will call the latter two terms together the geometric complexity of the model, for reasons that will become clear in the remainder of this paper. MDL arises as a finite series of terms in an asymptotic expansion of the Bayesian posterior probability of a model given the data for a special form of the parameter prior density [2] . Hence in essence, minimization of MDL is equivalent to maximization of the Bayesian posterior probability. In this paper we present a geometric interpretation of MDL, as well as Bayesian model selection [3], that provides an elegant and intuitive framework for understanding model complexity, a central concept in model selection. 2 Differential Geometric Interpretation of MDL From a geometric perspective, a parametric model family of probability distributions forms a Riemannian manifold embedded in the space of all probability distributions [4]. Every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise to a hyper-surface in which
4 0.54960632 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
Author: Igor V. Cadez, Padhraic Smyth
Abstract: We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifically we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within firstorder as a function of model complexity. This general property of
5 0.49588978 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
Author: Dörthe Malzahn, Manfred Opper
Abstract: Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves for Gaussian process regression models. The approximation works well in the large sample size limit and for arbitrary dimensionality of the input space. We explain how the approximation can be systematically improved and argue that similar techniques can be applied to general likelihood models. 1
6 0.45643267 27 nips-2000-Automatic Choice of Dimensionality for PCA
7 0.40905893 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals
8 0.39708796 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
9 0.3961817 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data
10 0.37076986 35 nips-2000-Computing with Finite and Infinite Networks
12 0.33416346 97 nips-2000-Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping
13 0.33097485 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
14 0.32973209 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
15 0.3260054 125 nips-2000-Stability and Noise in Biochemical Switches
16 0.32126141 85 nips-2000-Mixtures of Gaussian Processes
17 0.31949919 25 nips-2000-Analysis of Bit Error Probability of Direct-Sequence CDMA Multiuser Demodulators
18 0.31880456 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method
19 0.31737426 47 nips-2000-Error-correcting Codes on a Bethe-like Lattice
20 0.30158859 52 nips-2000-Fast Training of Support Vector Classifiers
topicId topicWeight
[(10, 0.012), (17, 0.077), (33, 0.026), (54, 0.017), (55, 0.012), (62, 0.018), (67, 0.665), (75, 0.015), (76, 0.058), (90, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.97892147 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
Author: Ilya Nemenman, William Bialek
Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x.}] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ .}) - (Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) est X X. (Q(Xl)Q(X2) ... Q(XN ))(0) , (2) where ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. Other definitions are also possible [6], but we think that most of our results do not depend on this choice. = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. These conditions imply that our field theory must be convergent at small length scales. For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective action [Eq. (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as ....., 1/ iN P( x). Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. However, both similarities are incomplete. The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. The algorithm to implement the discussed density estimation procedure numerically is rather simple. First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is then solved by a standard 'relaxation' (or Newton) method of iterative improvements to a guessed solution [13] (the target precision is always 10- 5 ). The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '
2 0.95740128 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations
Author: Oren Shriki, Haim Sompolinsky, Daniel D. Lee
Abstract: The principle of maximizing mutual information is applied to learning overcomplete and recurrent representations. The underlying model consists of a network of input units driving a larger number of output units with recurrent interactions. In the limit of zero noise, the network is deterministic and the mutual information can be related to the entropy of the output units. Maximizing this entropy with respect to both the feedforward connections as well as the recurrent interactions results in simple learning rules for both sets of parameters. The conventional independent components (ICA) learning algorithm can be recovered as a special case where there is an equal number of output units and no recurrent connections. The application of these new learning rules is illustrated on a simple two-dimensional input example.
3 0.81167531 134 nips-2000-The Kernel Trick for Distances
Author: Bernhard Schölkopf
Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.
4 0.69709259 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account
Author: Gal Chechik, Naftali Tishby
Abstract: The paradigm of Hebbian learning has recently received a novel interpretation with the discovery of synaptic plasticity that depends on the relative timing of pre and post synaptic spikes. This paper derives a temporally dependent learning rule from the basic principle of mutual information maximization and studies its relation to the experimentally observed plasticity. We find that a supervised spike-dependent learning rule sharing similar structure with the experimentally observed plasticity increases mutual information to a stable near optimal level. Moreover, the analysis reveals how the temporal structure of time-dependent learning rules is determined by the temporal filter applied by neurons over their inputs. These results suggest experimental prediction as to the dependency of the learning rule on neuronal biophysical parameters 1
5 0.62015575 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
Author: Sumio Watanabe
Abstract: Algebraic geometry is essential to learning theory. In hierarchical learning machines such as layered neural networks and gaussian mixtures, the asymptotic normality does not hold , since Fisher information matrices are singular. In this paper , the rigorous asymptotic form of the stochastic complexity is clarified based on resolution of singularities and two different problems are studied. (1) If the prior is positive, then the stochastic complexity is far smaller than BIO, resulting in the smaller generalization error than regular statistical models, even when the true distribution is not contained in the parametric model. (2) If Jeffreys' prior, which is coordinate free and equal to zero at singularities, is employed then the stochastic complexity has the same form as BIO. It is useful for model selection, but not for generalization. 1
6 0.53685188 146 nips-2000-What Can a Single Neuron Compute?
7 0.52586281 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA
8 0.52431643 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
9 0.52198064 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data
10 0.5166164 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method
11 0.51615417 44 nips-2000-Efficient Learning of Linear Perceptrons
12 0.50776279 21 nips-2000-Algorithmic Stability and Generalization Performance
13 0.50481504 124 nips-2000-Spike-Timing-Dependent Learning for Oscillatory Networks
14 0.49932334 22 nips-2000-Algorithms for Non-negative Matrix Factorization
15 0.49335322 125 nips-2000-Stability and Noise in Biochemical Switches
16 0.49040684 35 nips-2000-Computing with Finite and Infinite Networks
17 0.48880666 79 nips-2000-Learning Segmentation by Random Walks
18 0.48280367 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
19 0.47726643 102 nips-2000-Position Variance, Recurrence and Perceptual Learning
20 0.47237474 111 nips-2000-Regularized Winnow Methods