nips nips2000 nips2000-139 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: In Jae Myung, Mark A. Pitt, Shaobo Zhang, Vijay Balasubramanian
Abstract: How should we decide among competing explanations of a cognitive process given limited observations? The problem of model selection is at the heart of progress in cognitive science. In this paper, Minimum Description Length (MDL) is introduced as a method for selecting among computational models of cognition. We also show that differential geometry provides an intuitive understanding of what drives model selection in MDL. Finally, adequacy of MDL is demonstrated in two areas of cognitive modeling. 1 Model Selection and Model Complexity The development and testing of computational models of cognitive processing are a central focus in cognitive science. A model embodies a solution to a problem whose adequacy is evaluated by its ability to mimic behavior by capturing the regularities underlying observed data. This enterprise of model selection is challenging because of the competing goals that must be satisfied. Traditionally, computational models of cognition have been compared using one of many goodness-of-fit measures. However, use of such a measure can result in the choice of a model that over-fits the data, one that captures idiosyncracies in the particular data set (i.e., noise) over and above the underlying regularities of interest. Such models are considered complex, in that the inherent flexibility in the model enables it to fit diverse patterns of data. As a group, they can be characterized as having many parameters that are combined in a highly nonlinear fashion in the model equation. They do not assume a single structure in the data. Rather, the model contains multiple structures; each obtained by finely tuning the parameter values of the model, and thus can fit a wide range of data patterns. In contrast, simple models, frequently with few parameters, assume a specific structure in the data, which will manifest itself as a narrow range of similar data patterns. Only when one of these patterns occurs will the model fit the data well. The problem of over-fitting data due to model complexity suggests that the goal of model selection should instead be to select the model that generalizes best to all data samples that arise from the same underlying regularity, thus capturing only the regularity, not the noise. To achieve this goal, the selection method must be sensitive to the complexity of a model. There are at least two independent dimensions of model complexity. They are the number of free parameters of a model and its functional form, which refers to the way the parameters are combined in the model equation. For instance, it seems unlikely that two one-parameter models, y = ex and y = x 9, are equally complex in their ability to fit data. The two dimensions of model complexity (number of parameters and functional form) and their interplay can improve a model's fit to the data, without necessarily improving generalizability. The trademark of a good model selection procedure, then, is its ability to satisfy two opposing goals. A model must be sufficiently complex to describe the data sample accurately, but without over-fitting the data and thus losing generalizability. To achieve this end, we need a theoretically well-justified measure of model complexity that takes into account the number of parameters and the functional form of a model. In this paper, we introduce Minimum Description Length (MDL) as an appropriate method of selecting among mathematical models of cognition. We also show that MDL has an elegant geometric interpretation that provides a clear, intuitive understanding of the meaning of complexity in MDL. Finally, application examples of MDL are presented in two areas of cognitive modeling. 1.1 Minimum Description Length The central thesis of model selection is the estimation of a model's generalizability. One approach to assessing generalizability is the Minimum Description Length (MDL) principle [1]. It provides a theoretically well-grounded measure of complexity that is sensitive to both dimensions of complexity and also lends itself to intuitive, geometric interpretations. MDL was developed within algorithmic coding theory to choose the model that permits the greatest compression of data. A model family f with parameters e assigns the likelihood f(yle) to a given set of observed data y . The full form of the MDL measure for such a model family is given below. MDL = -In! (yISA) + ~ln( ; )+ In f dS.jdetl(S) where SA is the parameter that maximizes the likelihood, k is the number of parameters in the model, N is the sample size and I(e) is the Fisher information matrix. MDL is the length in bits of the shortest possible code that describes the data with the help of a model. In the context of cognitive modeling, the model that minimizes MDL uncovers the greatest amount of regularity (i.e., knowledge) underlying the data and therefore should be selected. The first, maximized log likelihood term is the lack-of-fit measure, and the second and third terms constitute the intrinsic complexity of the model. In particular, the third term captures the effects of complexity due to functional form, reflected through I(e). We will call the latter two terms together the geometric complexity of the model, for reasons that will become clear in the remainder of this paper. MDL arises as a finite series of terms in an asymptotic expansion of the Bayesian posterior probability of a model given the data for a special form of the parameter prior density [2] . Hence in essence, minimization of MDL is equivalent to maximization of the Bayesian posterior probability. In this paper we present a geometric interpretation of MDL, as well as Bayesian model selection [3], that provides an elegant and intuitive framework for understanding model complexity, a central concept in model selection. 2 Differential Geometric Interpretation of MDL From a geometric perspective, a parametric model family of probability distributions forms a Riemannian manifold embedded in the space of all probability distributions [4]. Every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise to a hyper-surface in which
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract How should we decide among competing explanations of a cognitive process given limited observations? [sent-9, score-0.153]
2 The problem of model selection is at the heart of progress in cognitive science. [sent-10, score-0.285]
3 In this paper, Minimum Description Length (MDL) is introduced as a method for selecting among computational models of cognition. [sent-11, score-0.103]
4 We also show that differential geometry provides an intuitive understanding of what drives model selection in MDL. [sent-12, score-0.328]
5 Finally, adequacy of MDL is demonstrated in two areas of cognitive modeling. [sent-13, score-0.122]
6 1 Model Selection and Model Complexity The development and testing of computational models of cognitive processing are a central focus in cognitive science. [sent-14, score-0.244]
7 A model embodies a solution to a problem whose adequacy is evaluated by its ability to mimic behavior by capturing the regularities underlying observed data. [sent-15, score-0.211]
8 This enterprise of model selection is challenging because of the competing goals that must be satisfied. [sent-16, score-0.25]
9 Traditionally, computational models of cognition have been compared using one of many goodness-of-fit measures. [sent-17, score-0.08]
10 However, use of such a measure can result in the choice of a model that over-fits the data, one that captures idiosyncracies in the particular data set (i. [sent-18, score-0.148]
11 Such models are considered complex, in that the inherent flexibility in the model enables it to fit diverse patterns of data. [sent-21, score-0.168]
12 As a group, they can be characterized as having many parameters that are combined in a highly nonlinear fashion in the model equation. [sent-22, score-0.11]
13 Rather, the model contains multiple structures; each obtained by finely tuning the parameter values of the model, and thus can fit a wide range of data patterns. [sent-24, score-0.173]
14 Only when one of these patterns occurs will the model fit the data well. [sent-26, score-0.138]
15 The problem of over-fitting data due to model complexity suggests that the goal of model selection should instead be to select the model that generalizes best to all data samples that arise from the same underlying regularity, thus capturing only the regularity, not the noise. [sent-27, score-0.552]
16 To achieve this goal, the selection method must be sensitive to the complexity of a model. [sent-28, score-0.24]
17 There are at least two independent dimensions of model complexity. [sent-29, score-0.11]
18 They are the number of free parameters of a model and its functional form, which refers to the way the parameters are combined in the model equation. [sent-30, score-0.272]
19 The two dimensions of model complexity (number of parameters and functional form) and their interplay can improve a model's fit to the data, without necessarily improving generalizability. [sent-32, score-0.344]
20 The trademark of a good model selection procedure, then, is its ability to satisfy two opposing goals. [sent-33, score-0.237]
21 A model must be sufficiently complex to describe the data sample accurately, but without over-fitting the data and thus losing generalizability. [sent-34, score-0.17]
22 To achieve this end, we need a theoretically well-justified measure of model complexity that takes into account the number of parameters and the functional form of a model. [sent-35, score-0.321]
23 In this paper, we introduce Minimum Description Length (MDL) as an appropriate method of selecting among mathematical models of cognition. [sent-36, score-0.103]
24 We also show that MDL has an elegant geometric interpretation that provides a clear, intuitive understanding of the meaning of complexity in MDL. [sent-37, score-0.502]
25 Finally, application examples of MDL are presented in two areas of cognitive modeling. [sent-38, score-0.082]
26 1 Minimum Description Length The central thesis of model selection is the estimation of a model's generalizability. [sent-40, score-0.203]
27 It provides a theoretically well-grounded measure of complexity that is sensitive to both dimensions of complexity and also lends itself to intuitive, geometric interpretations. [sent-42, score-0.516]
28 MDL was developed within algorithmic coding theory to choose the model that permits the greatest compression of data. [sent-43, score-0.106]
29 A model family f with parameters e assigns the likelihood f(yle) to a given set of observed data y . [sent-44, score-0.232]
30 The full form of the MDL measure for such a model family is given below. [sent-45, score-0.177]
31 jdetl(S) where SA is the parameter that maximizes the likelihood, k is the number of parameters in the model, N is the sample size and I(e) is the Fisher information matrix. [sent-48, score-0.109]
32 In the context of cognitive modeling, the model that minimizes MDL uncovers the greatest amount of regularity (i. [sent-50, score-0.231]
33 The first, maximized log likelihood term is the lack-of-fit measure, and the second and third terms constitute the intrinsic complexity of the model. [sent-53, score-0.228]
34 In particular, the third term captures the effects of complexity due to functional form, reflected through I(e). [sent-54, score-0.217]
35 We will call the latter two terms together the geometric complexity of the model, for reasons that will become clear in the remainder of this paper. [sent-55, score-0.348]
36 MDL arises as a finite series of terms in an asymptotic expansion of the Bayesian posterior probability of a model given the data for a special form of the parameter prior density [2] . [sent-56, score-0.163]
37 In this paper we present a geometric interpretation of MDL, as well as Bayesian model selection [3], that provides an elegant and intuitive framework for understanding model complexity, a central concept in model selection. [sent-58, score-0.708]
38 2 Differential Geometric Interpretation of MDL From a geometric perspective, a parametric model family of probability distributions forms a Riemannian manifold embedded in the space of all probability distributions [4]. [sent-59, score-0.615]
39 Every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise to a hyper-surface in which "similar" distributions are mapped to "nearby" points. [sent-60, score-0.187]
40 The infinitesimal distance between points separated by the infinitesimal parameter differences de; is given by ds 2 = Y' k. [sent-61, score-0.139]
41 l,j = l lJ Fisher information, lij(e), is the natural metric on a manifold of distributions in the context of statistical inference [4]. [sent-69, score-0.18]
42 We argue that the MDL measure of model fitness has an attractive interpretation in such a geometric context. [sent-70, score-0.371]
43 The first term in MDL estimates the accuracy of the model since the likelihood f(yI8 measures the ability of the model to fit the observed data. [sent-71, score-0.286]
44 The second and third terms are supposed to penalize model complexity; we will show that they have interesting geometric interpretations. [sent-72, score-0.316]
45 Given the metric gij = lij on the space of parameters, the infinitesimal volume element on the parameter manifold is A ) rt=l dV = d8 . [sent-73, score-0.273]
46 The Riemannian volume of the parameter manifold is obtained by integrating dV over the space of parameters: VM =f dV = f dS. [sent-76, score-0.134]
47 jdetl(S) In other words, the third term in MDL penalizes models that occupy a large volume in the space of distributions. [sent-78, score-0.19]
48 In fact, the volume measure VM is related to the number of "distinguishable" probability distributions indexed by the model M. [sent-79, score-0.278]
49 l Because of the way the model family is embedded in the space of distributions, two different parameter values can index very similar distributions. [sent-80, score-0.191]
50 If complexity is related to volumes occupied by model manifolds, the measure of volume should count only different, or distinguishable, distributions, and not the artificial coordinate volume. [sent-81, score-0.29]
51 While the third term in MDL measures the total volume of distributions a model can describe, the second term relates to the number of model distributions that lie close to the truth. [sent-83, score-0.435]
52 To see this, taking a Bayesian perspective on model selection is helpful. [sent-84, score-0.233]
53 Bayesian methods assume that the latter are the same for all models under consideration and analyze the socalled Bayesian posterior Pf = I de w(9)Pr(yI9)' Lacking prior knowledge, w should be chosen to weight all distinguishable distributions in the family equally. [sent-86, score-0.343]
54 For large sample sizes, the likelihood function f(yI8 localizes under general conditions to a multivariate A ) 1 Roughly speaking, two probability distributions are considered indistinguishable if one is mistaken for the other even in the presence of an infinite amount of data. [sent-88, score-0.184]
55 2 Note that the parameters of the model are always assumed to be cut off in a manner to ensure that VM is finite. [sent-91, score-0.11]
56 Gaussian centered at the maximum likelihood parameter e' (see [3,4] and citations therein). [sent-92, score-0.076]
57 Performing the integral and taking a log given the result - In Pf = -lnf(yIS') + In(VM / C M )+ 0(1/ N) where C M = (21t / N)k /2 h(S') where h(e') is a data-dependent factor that goes to 1 for large N when the truth lies withinf (see [3,4] for details). [sent-94, score-0.086]
58 In effect, CM measures the number of distinguishable distributions within f that lie close to the truth. [sent-96, score-0.232]
59 Using the expressions for CM and VM , the MDL selection criterion can be written as MDL = - In f (yle') + In(VM / C M) + terms sub leading in N (The subleading terms include the contribution of h(e'); see [3,4] regarding its role in Bayesian inference. [sent-97, score-0.126]
60 ) The geometric meaning of the complexity penalty in MDL now becomes clear; models which occupy a relatively large volume distant from the truth are penalized. [sent-98, score-0.586]
61 Models that contain a relatively large fraction of distributions lying close to the truth are preferred. [sent-99, score-0.163]
62 Therefore, we refer to the last two terms in MDL as geometric complexity. [sent-100, score-0.21]
63 It is also illuminating to collect terms in MD as MDL = -In[ ( f (yle') ): V M /C M = -In('' normalized maximized likelihood") Written this way, MDL selects the model that gives the highest value of the maximum likelihood, per the relative ratio of distinguishable distributions (VMICM). [sent-101, score-0.331]
64 From this perspective, a better model is simply one with many distinguishable distributions close to the truth, but few distinguishable distributions overall. [sent-102, score-0.541]
65 3 Application Examples Geometric complexity and MDL constitute a powerful pair of model evaluation tools. [sent-103, score-0.191]
66 When used together in model testing, a deeper understanding of the relationship between models can be gained. [sent-104, score-0.167]
67 The first measure enables one to assess the relative complexities of the set of models under consideration. [sent-105, score-0.135]
68 The second builds on the first by suggesting which model is preferable given the data in hand. [sent-106, score-0.103]
69 The following simulations demonstrate the application of these methods in two areas of cognitive modeling: information integration, and categorization. [sent-107, score-0.082]
70 In each example, two competing models were fitted to artificial data sets generated by each model. [sent-108, score-0.209]
71 Of interest is the ability of a selection method to recover the model that generated the data. [sent-109, score-0.27]
72 MDL is compared with two other selection methods, both of which consider the number of parameters only. [sent-110, score-0.159]
73 1 Information Integration In a typical information integration experiment, a range of stimuli are generated from a factorial manipulation of two or more stimulus dimensions (e. [sent-115, score-0.179]
74 Data are scored as the proportion of responses in one category across the various combinations of stimulus dimensions. [sent-118, score-0.103]
75 When the data were generated by FLMP, regardless of the selection method used, FLMP was recovered 100% of the time. [sent-126, score-0.185]
76 This was true across all selection methods and across both sample sizes, except for MDL when sample size was 20. [sent-127, score-0.208]
77 In this case, MDL did not perform quite as well as the other selection methods. [sent-128, score-0.126]
78 When the data were generated by LIM, AIC or BIC fared much more poorly whereas MDL recovered the correct model (LIM) across both sample sizes. [sent-129, score-0.177]
79 This observation was confirmed when the geometric complexity of each model was calculated. [sent-133, score-0.424]
80 The difference in geometric complexity between FLMP and LIM was 8. [sent-134, score-0.324]
81 74, meaning that for every distinguishable distribution for which LIM can account, FLMP can describe about e 8 . [sent-135, score-0.192]
82 Obviously, this difference in complexity between the two models must be due to the functional form because they have the same number of parameters. [sent-137, score-0.222]
83 2 Categorization Two models of categorization were considered in the present demonstration. [sent-139, score-0.103]
84 They were the generalized context model (GCM: [10]) and the prototype model (PRT: [11]). [sent-140, score-0.154]
85 Each model assumes that categorization responses follow a multinomial probability distribution with Pii (probability of category C] response given stimulus Xi), which is given by GCM: Pu =~ ~ i. [sent-141, score-0.252]
86 K SiK Sik In the equation, sij is a similarity measure between multidimensional stimuli Xi and Xj , SiJ is a similarity measure between stimulus Xi and the prototypic stimulus X j of category C j • Similarity is measured using the Minkowski distance metric with the metric parameter r. [sent-155, score-0.49]
87 The two models were fitted to data sets generated by each model using the six-dimensional scaling solution from Experiment 1 of [12] under the Euclidean distance metric of r = 2. [sent-156, score-0.297]
88 As shown in Table 2, under AIC or SIC, a relatively small bias toward choosing GCM was found using data generated from PRT when N = 20. [sent-157, score-0.093]
89 In the larger sample size condition, there was no difference in model recovery rate between AIC and MDL. [sent-159, score-0.144]
90 This outcome contrasts with that of the preceding example, in which MDL was generally superior to the other selection methods when sample size was smallest. [sent-160, score-0.167]
91 The only circumstances in which such an outcome is predicted under MDL is when the functional forms of the two models are similar (recall that the models have the same number of parameters), thus minimi zing the differential contribution of functional form in the complexity term. [sent-163, score-0.379]
92 Calculation of the geometric complexity of each model confirmed this suspicion. [sent-164, score-0.424]
93 These simulation results together demonstrate usefulness of MDL and the geometric complexity measure in testing models of cognition. [sent-169, score-0.449]
94 MDL's sensitivity to functional form was clearly demonstrated in its superior model recovery rate, especially when the complexities of the models differed by a nontrivial amount. [sent-170, score-0.245]
95 4 Conclusion Model selection in cognitive science can proceed far more confidently with a clear understanding of why one model should be preferred over another. [sent-171, score-0.343]
96 A geometric interpretation of MDL helps to achieve this goal. [sent-172, score-0.249]
97 The work carried out thus far indicates that MDL, along with the geometric complexity measure, holds considerable promise in evaluating computational models of cognition. [sent-173, score-0.38]
98 MDL chooses the correct model most of the time, and geometric complexity provides a measure of how different the two models are in their capacity or power. [sent-174, score-0.502]
99 (1999) Counting probability distributions: Differential geometry and model selection. [sent-207, score-0.102]
100 Toward a method of selecting among computational models of cognition. [sent-249, score-0.103]
wordName wordTfidf (topN-words)
[('mdl', 0.685), ('flmp', 0.28), ('geometric', 0.21), ('gcm', 0.2), ('prt', 0.2), ('lim', 0.179), ('distinguishable', 0.155), ('selection', 0.126), ('aic', 0.12), ('complexity', 0.114), ('vm', 0.101), ('truth', 0.086), ('cognitive', 0.082), ('yle', 0.08), ('distributions', 0.077), ('model', 0.077), ('integration', 0.061), ('balasubramanian', 0.06), ('myung', 0.06), ('metric', 0.058), ('models', 0.056), ('family', 0.055), ('volume', 0.054), ('functional', 0.052), ('stimulus', 0.052), ('riemannian', 0.052), ('od', 0.052), ('pitt', 0.052), ('infinitesimal', 0.052), ('category', 0.051), ('psychology', 0.049), ('differential', 0.049), ('bic', 0.047), ('competing', 0.047), ('categorization', 0.047), ('fitted', 0.047), ('bayesian', 0.046), ('measure', 0.045), ('manifold', 0.045), ('regularity', 0.043), ('intuitive', 0.042), ('likelihood', 0.041), ('sample', 0.041), ('adequacy', 0.04), ('cyis', 0.04), ('ecovery', 0.04), ('nosofsky', 0.04), ('pjj', 0.04), ('sik', 0.04), ('vijay', 0.04), ('wee', 0.04), ('interpretation', 0.039), ('pr', 0.039), ('cm', 0.038), ('dv', 0.037), ('meaning', 0.037), ('fit', 0.035), ('fisher', 0.035), ('parameter', 0.035), ('ates', 0.034), ('eis', 0.034), ('akaike', 0.034), ('sij', 0.034), ('complexities', 0.034), ('pu', 0.034), ('ability', 0.034), ('understanding', 0.034), ('toward', 0.034), ('parameters', 0.033), ('generated', 0.033), ('dimensions', 0.033), ('regularities', 0.031), ('perspective', 0.03), ('similarity', 0.03), ('greatest', 0.029), ('capturing', 0.029), ('lij', 0.029), ('occupy', 0.029), ('third', 0.029), ('grant', 0.028), ('data', 0.026), ('description', 0.026), ('ej', 0.026), ('zhang', 0.026), ('recovery', 0.026), ('elegant', 0.026), ('probability', 0.025), ('length', 0.025), ('minimum', 0.024), ('clear', 0.024), ('cognition', 0.024), ('embedded', 0.024), ('testing', 0.024), ('among', 0.024), ('selecting', 0.023), ('confirmed', 0.023), ('term', 0.022), ('maximized', 0.022), ('pf', 0.022), ('wo', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
Author: In Jae Myung, Mark A. Pitt, Shaobo Zhang, Vijay Balasubramanian
Abstract: How should we decide among competing explanations of a cognitive process given limited observations? The problem of model selection is at the heart of progress in cognitive science. In this paper, Minimum Description Length (MDL) is introduced as a method for selecting among computational models of cognition. We also show that differential geometry provides an intuitive understanding of what drives model selection in MDL. Finally, adequacy of MDL is demonstrated in two areas of cognitive modeling. 1 Model Selection and Model Complexity The development and testing of computational models of cognitive processing are a central focus in cognitive science. A model embodies a solution to a problem whose adequacy is evaluated by its ability to mimic behavior by capturing the regularities underlying observed data. This enterprise of model selection is challenging because of the competing goals that must be satisfied. Traditionally, computational models of cognition have been compared using one of many goodness-of-fit measures. However, use of such a measure can result in the choice of a model that over-fits the data, one that captures idiosyncracies in the particular data set (i.e., noise) over and above the underlying regularities of interest. Such models are considered complex, in that the inherent flexibility in the model enables it to fit diverse patterns of data. As a group, they can be characterized as having many parameters that are combined in a highly nonlinear fashion in the model equation. They do not assume a single structure in the data. Rather, the model contains multiple structures; each obtained by finely tuning the parameter values of the model, and thus can fit a wide range of data patterns. In contrast, simple models, frequently with few parameters, assume a specific structure in the data, which will manifest itself as a narrow range of similar data patterns. Only when one of these patterns occurs will the model fit the data well. The problem of over-fitting data due to model complexity suggests that the goal of model selection should instead be to select the model that generalizes best to all data samples that arise from the same underlying regularity, thus capturing only the regularity, not the noise. To achieve this goal, the selection method must be sensitive to the complexity of a model. There are at least two independent dimensions of model complexity. They are the number of free parameters of a model and its functional form, which refers to the way the parameters are combined in the model equation. For instance, it seems unlikely that two one-parameter models, y = ex and y = x 9, are equally complex in their ability to fit data. The two dimensions of model complexity (number of parameters and functional form) and their interplay can improve a model's fit to the data, without necessarily improving generalizability. The trademark of a good model selection procedure, then, is its ability to satisfy two opposing goals. A model must be sufficiently complex to describe the data sample accurately, but without over-fitting the data and thus losing generalizability. To achieve this end, we need a theoretically well-justified measure of model complexity that takes into account the number of parameters and the functional form of a model. In this paper, we introduce Minimum Description Length (MDL) as an appropriate method of selecting among mathematical models of cognition. We also show that MDL has an elegant geometric interpretation that provides a clear, intuitive understanding of the meaning of complexity in MDL. Finally, application examples of MDL are presented in two areas of cognitive modeling. 1.1 Minimum Description Length The central thesis of model selection is the estimation of a model's generalizability. One approach to assessing generalizability is the Minimum Description Length (MDL) principle [1]. It provides a theoretically well-grounded measure of complexity that is sensitive to both dimensions of complexity and also lends itself to intuitive, geometric interpretations. MDL was developed within algorithmic coding theory to choose the model that permits the greatest compression of data. A model family f with parameters e assigns the likelihood f(yle) to a given set of observed data y . The full form of the MDL measure for such a model family is given below. MDL = -In! (yISA) + ~ln( ; )+ In f dS.jdetl(S) where SA is the parameter that maximizes the likelihood, k is the number of parameters in the model, N is the sample size and I(e) is the Fisher information matrix. MDL is the length in bits of the shortest possible code that describes the data with the help of a model. In the context of cognitive modeling, the model that minimizes MDL uncovers the greatest amount of regularity (i.e., knowledge) underlying the data and therefore should be selected. The first, maximized log likelihood term is the lack-of-fit measure, and the second and third terms constitute the intrinsic complexity of the model. In particular, the third term captures the effects of complexity due to functional form, reflected through I(e). We will call the latter two terms together the geometric complexity of the model, for reasons that will become clear in the remainder of this paper. MDL arises as a finite series of terms in an asymptotic expansion of the Bayesian posterior probability of a model given the data for a special form of the parameter prior density [2] . Hence in essence, minimization of MDL is equivalent to maximization of the Bayesian posterior probability. In this paper we present a geometric interpretation of MDL, as well as Bayesian model selection [3], that provides an elegant and intuitive framework for understanding model complexity, a central concept in model selection. 2 Differential Geometric Interpretation of MDL From a geometric perspective, a parametric model family of probability distributions forms a Riemannian manifold embedded in the space of all probability distributions [4]. Every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise to a hyper-surface in which
2 0.13914445 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
Author: Ilya Nemenman, William Bialek
Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x.}] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ .}) - (Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) est X X. (Q(Xl)Q(X2) ... Q(XN ))(0) , (2) where ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. Other definitions are also possible [6], but we think that most of our results do not depend on this choice. = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. These conditions imply that our field theory must be convergent at small length scales. For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective action [Eq. (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as ....., 1/ iN P( x). Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. However, both similarities are incomplete. The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. The algorithm to implement the discussed density estimation procedure numerically is rather simple. First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is then solved by a standard 'relaxation' (or Newton) method of iterative improvements to a guessed solution [13] (the target precision is always 10- 5 ). The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '
3 0.11801723 10 nips-2000-A Productive, Systematic Framework for the Representation of Visual Structure
Author: Shimon Edelman, Nathan Intrator
Abstract: We describe a unified framework for the understanding of structure representation in primate vision. A model derived from this framework is shown to be effectively systematic in that it has the ability to interpret and associate together objects that are related through a rearrangement of common
4 0.09008912 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
Author: Igor V. Cadez, Padhraic Smyth
Abstract: We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifically we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within firstorder as a function of model complexity. This general property of
5 0.082161747 92 nips-2000-Occam's Razor
Author: Carl Edward Rasmussen, Zoubin Ghahramani
Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.
6 0.073730126 27 nips-2000-Automatic Choice of Dimensionality for PCA
7 0.064053096 54 nips-2000-Feature Selection for SVMs
8 0.060750261 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
9 0.054653943 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
10 0.047979943 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models
11 0.047947612 148 nips-2000-`N-Body' Problems in Statistical Learning
12 0.044532154 88 nips-2000-Multiple Timescales of Adaptation in a Neural Code
13 0.04441078 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
14 0.044386145 146 nips-2000-What Can a Single Neuron Compute?
15 0.041738652 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method
16 0.041528959 23 nips-2000-An Adaptive Metric Machine for Pattern Classification
17 0.041456822 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
18 0.04130194 133 nips-2000-The Kernel Gibbs Sampler
19 0.041292109 141 nips-2000-Universality and Individuality in a Neural Code
20 0.03988532 114 nips-2000-Second Order Approximations for Probability Models
topicId topicWeight
[(0, 0.158), (1, -0.018), (2, 0.011), (3, 0.001), (4, 0.065), (5, 0.034), (6, 0.016), (7, 0.045), (8, 0.048), (9, -0.097), (10, -0.022), (11, 0.171), (12, 0.102), (13, 0.02), (14, 0.067), (15, 0.111), (16, 0.014), (17, -0.224), (18, -0.162), (19, 0.064), (20, -0.039), (21, 0.053), (22, -0.225), (23, -0.083), (24, 0.067), (25, -0.029), (26, 0.106), (27, -0.039), (28, -0.037), (29, -0.024), (30, -0.05), (31, 0.091), (32, -0.044), (33, 0.084), (34, 0.012), (35, -0.259), (36, 0.065), (37, -0.008), (38, 0.033), (39, -0.001), (40, -0.111), (41, 0.011), (42, 0.053), (43, 0.081), (44, 0.012), (45, -0.043), (46, -0.005), (47, -0.168), (48, 0.122), (49, 0.136)]
simIndex simValue paperId paperTitle
same-paper 1 0.94173163 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
Author: In Jae Myung, Mark A. Pitt, Shaobo Zhang, Vijay Balasubramanian
Abstract: How should we decide among competing explanations of a cognitive process given limited observations? The problem of model selection is at the heart of progress in cognitive science. In this paper, Minimum Description Length (MDL) is introduced as a method for selecting among computational models of cognition. We also show that differential geometry provides an intuitive understanding of what drives model selection in MDL. Finally, adequacy of MDL is demonstrated in two areas of cognitive modeling. 1 Model Selection and Model Complexity The development and testing of computational models of cognitive processing are a central focus in cognitive science. A model embodies a solution to a problem whose adequacy is evaluated by its ability to mimic behavior by capturing the regularities underlying observed data. This enterprise of model selection is challenging because of the competing goals that must be satisfied. Traditionally, computational models of cognition have been compared using one of many goodness-of-fit measures. However, use of such a measure can result in the choice of a model that over-fits the data, one that captures idiosyncracies in the particular data set (i.e., noise) over and above the underlying regularities of interest. Such models are considered complex, in that the inherent flexibility in the model enables it to fit diverse patterns of data. As a group, they can be characterized as having many parameters that are combined in a highly nonlinear fashion in the model equation. They do not assume a single structure in the data. Rather, the model contains multiple structures; each obtained by finely tuning the parameter values of the model, and thus can fit a wide range of data patterns. In contrast, simple models, frequently with few parameters, assume a specific structure in the data, which will manifest itself as a narrow range of similar data patterns. Only when one of these patterns occurs will the model fit the data well. The problem of over-fitting data due to model complexity suggests that the goal of model selection should instead be to select the model that generalizes best to all data samples that arise from the same underlying regularity, thus capturing only the regularity, not the noise. To achieve this goal, the selection method must be sensitive to the complexity of a model. There are at least two independent dimensions of model complexity. They are the number of free parameters of a model and its functional form, which refers to the way the parameters are combined in the model equation. For instance, it seems unlikely that two one-parameter models, y = ex and y = x 9, are equally complex in their ability to fit data. The two dimensions of model complexity (number of parameters and functional form) and their interplay can improve a model's fit to the data, without necessarily improving generalizability. The trademark of a good model selection procedure, then, is its ability to satisfy two opposing goals. A model must be sufficiently complex to describe the data sample accurately, but without over-fitting the data and thus losing generalizability. To achieve this end, we need a theoretically well-justified measure of model complexity that takes into account the number of parameters and the functional form of a model. In this paper, we introduce Minimum Description Length (MDL) as an appropriate method of selecting among mathematical models of cognition. We also show that MDL has an elegant geometric interpretation that provides a clear, intuitive understanding of the meaning of complexity in MDL. Finally, application examples of MDL are presented in two areas of cognitive modeling. 1.1 Minimum Description Length The central thesis of model selection is the estimation of a model's generalizability. One approach to assessing generalizability is the Minimum Description Length (MDL) principle [1]. It provides a theoretically well-grounded measure of complexity that is sensitive to both dimensions of complexity and also lends itself to intuitive, geometric interpretations. MDL was developed within algorithmic coding theory to choose the model that permits the greatest compression of data. A model family f with parameters e assigns the likelihood f(yle) to a given set of observed data y . The full form of the MDL measure for such a model family is given below. MDL = -In! (yISA) + ~ln( ; )+ In f dS.jdetl(S) where SA is the parameter that maximizes the likelihood, k is the number of parameters in the model, N is the sample size and I(e) is the Fisher information matrix. MDL is the length in bits of the shortest possible code that describes the data with the help of a model. In the context of cognitive modeling, the model that minimizes MDL uncovers the greatest amount of regularity (i.e., knowledge) underlying the data and therefore should be selected. The first, maximized log likelihood term is the lack-of-fit measure, and the second and third terms constitute the intrinsic complexity of the model. In particular, the third term captures the effects of complexity due to functional form, reflected through I(e). We will call the latter two terms together the geometric complexity of the model, for reasons that will become clear in the remainder of this paper. MDL arises as a finite series of terms in an asymptotic expansion of the Bayesian posterior probability of a model given the data for a special form of the parameter prior density [2] . Hence in essence, minimization of MDL is equivalent to maximization of the Bayesian posterior probability. In this paper we present a geometric interpretation of MDL, as well as Bayesian model selection [3], that provides an elegant and intuitive framework for understanding model complexity, a central concept in model selection. 2 Differential Geometric Interpretation of MDL From a geometric perspective, a parametric model family of probability distributions forms a Riemannian manifold embedded in the space of all probability distributions [4]. Every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise to a hyper-surface in which
2 0.55807739 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
Author: Ilya Nemenman, William Bialek
Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x.}] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ .}) - (Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) est X X. (Q(Xl)Q(X2) ... Q(XN ))(0) , (2) where ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. Other definitions are also possible [6], but we think that most of our results do not depend on this choice. = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. These conditions imply that our field theory must be convergent at small length scales. For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective action [Eq. (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as ....., 1/ iN P( x). Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. However, both similarities are incomplete. The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. The algorithm to implement the discussed density estimation procedure numerically is rather simple. First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is then solved by a standard 'relaxation' (or Newton) method of iterative improvements to a guessed solution [13] (the target precision is always 10- 5 ). The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '
3 0.52876735 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
Author: Igor V. Cadez, Padhraic Smyth
Abstract: We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifically we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within firstorder as a function of model complexity. This general property of
4 0.49847692 92 nips-2000-Occam's Razor
Author: Carl Edward Rasmussen, Zoubin Ghahramani
Abstract: The Bayesian paradigm apparently only sometimes gives rise to Occam's Razor; at other times very large models perform well. We give simple examples of both kinds of behaviour. The two views are reconciled when measuring complexity of functions, rather than of the machinery used to implement them. We analyze the complexity of functions for some linear in the parameter models that are equivalent to Gaussian Processes, and always find Occam's Razor at work.
5 0.45418915 10 nips-2000-A Productive, Systematic Framework for the Representation of Visual Structure
Author: Shimon Edelman, Nathan Intrator
Abstract: We describe a unified framework for the understanding of structure representation in primate vision. A model derived from this framework is shown to be effectively systematic in that it has the ability to interpret and associate together objects that are related through a rearrangement of common
6 0.43200713 27 nips-2000-Automatic Choice of Dimensionality for PCA
7 0.42383716 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
8 0.35712373 132 nips-2000-The Interplay of Symbolic and Subsymbolic Processes in Anagram Problem Solving
9 0.32867911 44 nips-2000-Efficient Learning of Linear Perceptrons
10 0.29343602 127 nips-2000-Structure Learning in Human Causal Induction
11 0.29240081 148 nips-2000-`N-Body' Problems in Statistical Learning
13 0.28591785 40 nips-2000-Dendritic Compartmentalization Could Underlie Competition and Attentional Biasing of Simultaneous Visual Stimuli
14 0.28562269 54 nips-2000-Feature Selection for SVMs
15 0.27141753 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
16 0.27120167 56 nips-2000-Foundations for a Circuit Complexity Theory of Sensory Processing
17 0.2441431 138 nips-2000-The Use of Classifiers in Sequential Inference
18 0.21980876 23 nips-2000-An Adaptive Metric Machine for Pattern Classification
19 0.21358362 42 nips-2000-Divisive and Subtractive Mask Effects: Linking Psychophysics and Biophysics
20 0.20559174 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
topicId topicWeight
[(10, 0.031), (17, 0.116), (32, 0.025), (33, 0.042), (54, 0.011), (55, 0.024), (56, 0.322), (62, 0.051), (65, 0.018), (67, 0.056), (72, 0.018), (76, 0.062), (79, 0.04), (81, 0.035), (90, 0.02), (97, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.78749269 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
Author: In Jae Myung, Mark A. Pitt, Shaobo Zhang, Vijay Balasubramanian
Abstract: How should we decide among competing explanations of a cognitive process given limited observations? The problem of model selection is at the heart of progress in cognitive science. In this paper, Minimum Description Length (MDL) is introduced as a method for selecting among computational models of cognition. We also show that differential geometry provides an intuitive understanding of what drives model selection in MDL. Finally, adequacy of MDL is demonstrated in two areas of cognitive modeling. 1 Model Selection and Model Complexity The development and testing of computational models of cognitive processing are a central focus in cognitive science. A model embodies a solution to a problem whose adequacy is evaluated by its ability to mimic behavior by capturing the regularities underlying observed data. This enterprise of model selection is challenging because of the competing goals that must be satisfied. Traditionally, computational models of cognition have been compared using one of many goodness-of-fit measures. However, use of such a measure can result in the choice of a model that over-fits the data, one that captures idiosyncracies in the particular data set (i.e., noise) over and above the underlying regularities of interest. Such models are considered complex, in that the inherent flexibility in the model enables it to fit diverse patterns of data. As a group, they can be characterized as having many parameters that are combined in a highly nonlinear fashion in the model equation. They do not assume a single structure in the data. Rather, the model contains multiple structures; each obtained by finely tuning the parameter values of the model, and thus can fit a wide range of data patterns. In contrast, simple models, frequently with few parameters, assume a specific structure in the data, which will manifest itself as a narrow range of similar data patterns. Only when one of these patterns occurs will the model fit the data well. The problem of over-fitting data due to model complexity suggests that the goal of model selection should instead be to select the model that generalizes best to all data samples that arise from the same underlying regularity, thus capturing only the regularity, not the noise. To achieve this goal, the selection method must be sensitive to the complexity of a model. There are at least two independent dimensions of model complexity. They are the number of free parameters of a model and its functional form, which refers to the way the parameters are combined in the model equation. For instance, it seems unlikely that two one-parameter models, y = ex and y = x 9, are equally complex in their ability to fit data. The two dimensions of model complexity (number of parameters and functional form) and their interplay can improve a model's fit to the data, without necessarily improving generalizability. The trademark of a good model selection procedure, then, is its ability to satisfy two opposing goals. A model must be sufficiently complex to describe the data sample accurately, but without over-fitting the data and thus losing generalizability. To achieve this end, we need a theoretically well-justified measure of model complexity that takes into account the number of parameters and the functional form of a model. In this paper, we introduce Minimum Description Length (MDL) as an appropriate method of selecting among mathematical models of cognition. We also show that MDL has an elegant geometric interpretation that provides a clear, intuitive understanding of the meaning of complexity in MDL. Finally, application examples of MDL are presented in two areas of cognitive modeling. 1.1 Minimum Description Length The central thesis of model selection is the estimation of a model's generalizability. One approach to assessing generalizability is the Minimum Description Length (MDL) principle [1]. It provides a theoretically well-grounded measure of complexity that is sensitive to both dimensions of complexity and also lends itself to intuitive, geometric interpretations. MDL was developed within algorithmic coding theory to choose the model that permits the greatest compression of data. A model family f with parameters e assigns the likelihood f(yle) to a given set of observed data y . The full form of the MDL measure for such a model family is given below. MDL = -In! (yISA) + ~ln( ; )+ In f dS.jdetl(S) where SA is the parameter that maximizes the likelihood, k is the number of parameters in the model, N is the sample size and I(e) is the Fisher information matrix. MDL is the length in bits of the shortest possible code that describes the data with the help of a model. In the context of cognitive modeling, the model that minimizes MDL uncovers the greatest amount of regularity (i.e., knowledge) underlying the data and therefore should be selected. The first, maximized log likelihood term is the lack-of-fit measure, and the second and third terms constitute the intrinsic complexity of the model. In particular, the third term captures the effects of complexity due to functional form, reflected through I(e). We will call the latter two terms together the geometric complexity of the model, for reasons that will become clear in the remainder of this paper. MDL arises as a finite series of terms in an asymptotic expansion of the Bayesian posterior probability of a model given the data for a special form of the parameter prior density [2] . Hence in essence, minimization of MDL is equivalent to maximization of the Bayesian posterior probability. In this paper we present a geometric interpretation of MDL, as well as Bayesian model selection [3], that provides an elegant and intuitive framework for understanding model complexity, a central concept in model selection. 2 Differential Geometric Interpretation of MDL From a geometric perspective, a parametric model family of probability distributions forms a Riemannian manifold embedded in the space of all probability distributions [4]. Every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise to a hyper-surface in which
2 0.7300995 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
Author: Sebastian Mika, Gunnar R채tsch, Klaus-Robert M체ller
Abstract: We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD). A mathematical programming formulation based on the observation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. Simulations support the usefulness of our approach.
3 0.4511005 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
Author: Zoubin Ghahramani, Matthew J. Beal
Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1
4 0.44555986 122 nips-2000-Sparse Representation for Gaussian Process Models
Author: Lehel Csatč´¸, Manfred Opper
Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.
5 0.43596628 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
Author: Thomas Natschläger, Wolfgang Maass, Eduardo D. Sontag, Anthony M. Zador
Abstract: Experimental data show that biological synapses behave quite differently from the symbolic synapses in common artificial neural network models. Biological synapses are dynamic, i.e., their
6 0.43342814 37 nips-2000-Convergence of Large Margin Separable Linear Classification
7 0.43302372 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
8 0.43271267 146 nips-2000-What Can a Single Neuron Compute?
9 0.43187299 74 nips-2000-Kernel Expansions with Unlabeled Examples
10 0.43107012 60 nips-2000-Gaussianization
11 0.42958376 94 nips-2000-On Reversing Jensen's Inequality
12 0.42884761 92 nips-2000-Occam's Razor
13 0.42809644 133 nips-2000-The Kernel Gibbs Sampler
14 0.42632779 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
15 0.42464596 4 nips-2000-A Linear Programming Approach to Novelty Detection
16 0.42248073 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
17 0.42241865 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
18 0.42239848 79 nips-2000-Learning Segmentation by Random Walks
19 0.42183477 49 nips-2000-Explaining Away in Weight Space
20 0.42017558 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data