andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1431 knowledge-graph by maker-knowledge-mining

1431 andrew gelman stats-2012-07-27-Overfitting


meta infos for this blog

Source: html

Introduction: Ilya Esteban writes: In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc., all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best. While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? If you try several different


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ilya Esteban writes: In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc. [sent-1, score-0.914]

2 , all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. [sent-2, score-0.911]

3 You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best. [sent-3, score-0.477]

4 While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. [sent-4, score-1.157]

5 When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? [sent-5, score-1.039]

6 If you try several different priors/hyperpriors with your model, then simply choose the values and distributions that produce the best “fit” for your data, you are probably just choosing the most optimistic, rather than the most accurate model. [sent-6, score-0.24]

7 My reply: I have no magic answer but in general I prefer continuous model expansion instead of discrete model choice or averaging. [sent-7, score-1.139]

8 Also, I think many of the worst problems with over fitting arise when using least squares or other flat-prior methods. [sent-8, score-0.147]

9 More generally this is a big open problem in statistics and AI: how to make one’s model general enough to let the data speak but structured enough to hear what the data have to say. [sent-10, score-0.982]

10 To which Esteban writes: I agree with idea of continuous model expansion. [sent-11, score-0.457]

11 However, it still leaves open questions such as “do I give the variance prior a uniform, a half-Normal or a half-Cauchy distribution”. [sent-12, score-0.41]

12 These questions have to be answered discretely, rather than using continuous expansion, which in turn calls for an objective criteria for selecting among the alternate model formulations. [sent-13, score-1.419]

13 My reply: In theory, this can still be done using model expansion (for example, the half-t family includes uniform, half-normal, and half-Cauchy as special cases) but in practice, yes, choices must be made. [sent-14, score-0.666]

14 But I don’t know that I need an objective criterion for choosing; I think it could be enough to have objective measures of prediction error and to use these along with general understanding to pick a model. [sent-15, score-0.663]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('model', 0.279), ('esteban', 0.236), ('expansion', 0.232), ('objective', 0.202), ('alternate', 0.194), ('optimistic', 0.186), ('overfitting', 0.182), ('continuous', 0.178), ('selecting', 0.173), ('structured', 0.153), ('input', 0.144), ('choosing', 0.142), ('uniform', 0.14), ('techniques', 0.129), ('features', 0.121), ('fiddling', 0.118), ('discretely', 0.118), ('manually', 0.118), ('ilya', 0.111), ('priors', 0.11), ('learning', 0.1), ('forcing', 0.1), ('values', 0.098), ('data', 0.096), ('ai', 0.093), ('trap', 0.093), ('open', 0.093), ('general', 0.091), ('validation', 0.09), ('models', 0.089), ('overly', 0.088), ('enough', 0.087), ('answered', 0.087), ('falling', 0.085), ('damage', 0.084), ('imperfect', 0.083), ('structures', 0.081), ('criterion', 0.081), ('still', 0.081), ('among', 0.081), ('magic', 0.08), ('prior', 0.08), ('producing', 0.08), ('questions', 0.079), ('solutions', 0.079), ('leaves', 0.077), ('weights', 0.074), ('using', 0.074), ('squares', 0.073), ('criteria', 0.072)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999982 1431 andrew gelman stats-2012-07-27-Overfitting

Introduction: Ilya Esteban writes: In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc., all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best. While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? If you try several different

2 0.2243019 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis

Introduction: I’ve been writing a lot about my philosophy of Bayesian statistics and how it fits into Popper’s ideas about falsification and Kuhn’s ideas about scientific revolutions. Here’s my long, somewhat technical paper with Cosma Shalizi. Here’s our shorter overview for the volume on the philosophy of social science. Here’s my latest try (for an online symposium), focusing on the key issues. I’m pretty happy with my approach–the familiar idea that Bayesian data analysis iterates the three steps of model building, inference, and model checking–but it does have some unresolved (maybe unresolvable) problems. Here are a couple mentioned in the third of the above links. Consider a simple model with independent data y_1, y_2, .., y_10 ~ N(θ,σ^2), with a prior distribution θ ~ N(0,10^2) and σ known and taking on some value of approximately 10. Inference about μ is straightforward, as is model checking, whether based on graphs or numerical summaries such as the sample variance and skewn

3 0.21260214 1247 andrew gelman stats-2012-04-05-More philosophy of Bayes

Introduction: Konrad Scheffler writes: I was interested by your paper “Induction and deduction in Bayesian data analysis” and was wondering if you would entertain a few questions: – Under the banner of objective Bayesianism, I would posit something like this as a description of Bayesian inference: “Objective Bayesian probability is not a degree of belief (which would necessarily be subjective) but a measure of the plausibility of a hypothesis, conditional on a formally specified information state. One way of specifying a formal information state is to specify a model, which involves specifying both a prior distribution (typically for a set of unobserved variables) and a likelihood function (typically for a set of observed variables, conditioned on the values of the unobserved variables). Bayesian inference involves calculating the objective degree of plausibility of a hypothesis (typically the truth value of the hypothesis is a function of the variables mentioned above) given such a

4 0.19587457 1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model

Introduction: Nick Firoozye writes: I had a question about BMA [Bayesian model averaging] and model combinations in general, and direct it to you since they are a basic form of hierarchical model, albeit in the simplest of forms. I wanted to ask what the underlying assumptions are that could lead to BMA improving on a larger model. I know model combination is a topic of interest in the (frequentist) econometrics community (e.g., Bates & Granger, http://www.jstor.org/discover/10.2307/3008764?uid=3738032&uid;=2&uid;=4&sid;=21101948653381) but at the time it was considered a bit of a puzzle. Perhaps small models combined outperform a big model due to standard errors, insufficient data, etc. But I haven’t seen much in way of Bayesian justification. In simplest terms, you might have a joint density P(Y,theta_1,theta_2) from which you could use the two marginals P(Y,theta_1) and P(Y,theta_2) to derive two separate forecasts. A BMA-er would do a weighted average of the two forecast densities, having p

5 0.19556253 1510 andrew gelman stats-2012-09-25-Incoherence of Bayesian data analysis

Introduction: Hogg writes: At the end this article you wonder about consistency. Have you ever considered the possibility that utility might resolve some of the problems? I have no idea if it would—I am not advocating that position—I just get some kind of intuition from phrases like “Judgment is required to decide…”. Perhaps there is a coherent and objective description of what is—or could be—done under a coherent “utility” model (like a utility that could be objectively agreed upon and computed). Utilities are usually subjective—true—but priors are usually subjective too. My reply: I’m happy to think about utility, for some particular problem or class of problems going to the effort of assigning costs and benefits to different outcomes. I agree that a utility analysis, even if (necessarily) imperfect, can usefully focus discussion. For example, if a statistical method for selecting variables is justified on the basis of cost, I like the idea of attempting to quantify the costs of ga

6 0.19022597 754 andrew gelman stats-2011-06-09-Difficulties with Bayesian model averaging

7 0.18846492 1972 andrew gelman stats-2013-08-07-When you’re planning on fitting a model, build up to it by fitting simpler models first. Then, once you have a model you like, check the hell out of it

8 0.17664891 846 andrew gelman stats-2011-08-09-Default priors update?

9 0.17493129 1392 andrew gelman stats-2012-06-26-Occam

10 0.1712908 774 andrew gelman stats-2011-06-20-The pervasive twoishness of statistics; in particular, the “sampling distribution” and the “likelihood” are two different models, and that’s a good thing

11 0.16657914 1228 andrew gelman stats-2012-03-25-Continuous variables in Bayesian networks

12 0.16349781 1482 andrew gelman stats-2012-09-04-Model checking and model understanding in machine learning

13 0.16349091 811 andrew gelman stats-2011-07-20-Kind of Bayesian

14 0.15852273 653 andrew gelman stats-2011-04-08-Multilevel regression with shrinkage for “fixed” effects

15 0.1559419 1465 andrew gelman stats-2012-08-21-D. Buggin

16 0.15112162 1092 andrew gelman stats-2011-12-29-More by Berger and me on weakly informative priors

17 0.15100577 2133 andrew gelman stats-2013-12-13-Flexibility is good

18 0.14723963 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable

19 0.1470789 1941 andrew gelman stats-2013-07-16-Priors

20 0.14547326 2129 andrew gelman stats-2013-12-10-Cross-validation and Bayesian estimation of tuning parameters


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.24), (1, 0.257), (2, 0.003), (3, 0.079), (4, 0.006), (5, 0.025), (6, 0.014), (7, -0.015), (8, 0.052), (9, 0.098), (10, 0.037), (11, 0.067), (12, -0.052), (13, 0.022), (14, -0.074), (15, -0.016), (16, 0.041), (17, -0.051), (18, 0.004), (19, 0.013), (20, -0.019), (21, -0.048), (22, -0.037), (23, -0.044), (24, -0.069), (25, 0.021), (26, -0.01), (27, -0.066), (28, 0.014), (29, -0.0), (30, -0.011), (31, -0.025), (32, -0.004), (33, 0.01), (34, 0.0), (35, 0.038), (36, -0.017), (37, 0.01), (38, 0.001), (39, 0.004), (40, -0.005), (41, -0.004), (42, -0.002), (43, 0.068), (44, -0.036), (45, 0.025), (46, 0.027), (47, -0.023), (48, 0.028), (49, 0.032)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98073459 1431 andrew gelman stats-2012-07-27-Overfitting

Introduction: Ilya Esteban writes: In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc., all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best. While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? If you try several different

2 0.9337582 1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model

Introduction: Nick Firoozye writes: I had a question about BMA [Bayesian model averaging] and model combinations in general, and direct it to you since they are a basic form of hierarchical model, albeit in the simplest of forms. I wanted to ask what the underlying assumptions are that could lead to BMA improving on a larger model. I know model combination is a topic of interest in the (frequentist) econometrics community (e.g., Bates & Granger, http://www.jstor.org/discover/10.2307/3008764?uid=3738032&uid;=2&uid;=4&sid;=21101948653381) but at the time it was considered a bit of a puzzle. Perhaps small models combined outperform a big model due to standard errors, insufficient data, etc. But I haven’t seen much in way of Bayesian justification. In simplest terms, you might have a joint density P(Y,theta_1,theta_2) from which you could use the two marginals P(Y,theta_1) and P(Y,theta_2) to derive two separate forecasts. A BMA-er would do a weighted average of the two forecast densities, having p

3 0.90745473 1510 andrew gelman stats-2012-09-25-Incoherence of Bayesian data analysis

Introduction: Hogg writes: At the end this article you wonder about consistency. Have you ever considered the possibility that utility might resolve some of the problems? I have no idea if it would—I am not advocating that position—I just get some kind of intuition from phrases like “Judgment is required to decide…”. Perhaps there is a coherent and objective description of what is—or could be—done under a coherent “utility” model (like a utility that could be objectively agreed upon and computed). Utilities are usually subjective—true—but priors are usually subjective too. My reply: I’m happy to think about utility, for some particular problem or class of problems going to the effort of assigning costs and benefits to different outcomes. I agree that a utility analysis, even if (necessarily) imperfect, can usefully focus discussion. For example, if a statistical method for selecting variables is justified on the basis of cost, I like the idea of attempting to quantify the costs of ga

4 0.90631586 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis

Introduction: I’ve been writing a lot about my philosophy of Bayesian statistics and how it fits into Popper’s ideas about falsification and Kuhn’s ideas about scientific revolutions. Here’s my long, somewhat technical paper with Cosma Shalizi. Here’s our shorter overview for the volume on the philosophy of social science. Here’s my latest try (for an online symposium), focusing on the key issues. I’m pretty happy with my approach–the familiar idea that Bayesian data analysis iterates the three steps of model building, inference, and model checking–but it does have some unresolved (maybe unresolvable) problems. Here are a couple mentioned in the third of the above links. Consider a simple model with independent data y_1, y_2, .., y_10 ~ N(θ,σ^2), with a prior distribution θ ~ N(0,10^2) and σ known and taking on some value of approximately 10. Inference about μ is straightforward, as is model checking, whether based on graphs or numerical summaries such as the sample variance and skewn

5 0.89253008 754 andrew gelman stats-2011-06-09-Difficulties with Bayesian model averaging

Introduction: In response to this article by Cosma Shalizi and myself on the philosophy of Bayesian statistics, David Hogg writes: I [Hogg] agree–even in physics and astronomy–that the models are not “True” in the God-like sense of being absolute reality (that is, I am not a realist); and I have argued (a philosophically very naive paper, but hey, I was new to all this) that for pretty fundamental reasons we could never arrive at the True (with a capital “T”) model of the Universe. The goal of inference is to find the “best” model, where “best” might have something to do with prediction, or explanation, or message length, or (horror!) our utility. Needless to say, most of my physics friends *are* realists, even in the face of “effective theories” as Newtonian mechanics is an effective theory of GR and GR is an effective theory of “quantum gravity” (this plays to your point, because if you think any theory is possibly an effective theory, how could you ever find Truth?). I also liked the i

6 0.89061654 1459 andrew gelman stats-2012-08-15-How I think about mixture models

7 0.89021146 1392 andrew gelman stats-2012-06-26-Occam

8 0.88633186 1141 andrew gelman stats-2012-01-28-Using predator-prey models on the Canadian lynx series

9 0.88521576 1216 andrew gelman stats-2012-03-17-Modeling group-level predictors in a multilevel regression

10 0.88378704 1406 andrew gelman stats-2012-07-05-Xiao-Li Meng and Xianchao Xie rethink asymptotics

11 0.86519611 1287 andrew gelman stats-2012-04-28-Understanding simulations in terms of predictive inference?

12 0.85925192 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?

13 0.85801148 1247 andrew gelman stats-2012-04-05-More philosophy of Bayes

14 0.85775083 773 andrew gelman stats-2011-06-18-Should we always be using the t and robit instead of the normal and logit?

15 0.85666031 1041 andrew gelman stats-2011-12-04-David MacKay and Occam’s Razor

16 0.85363448 1972 andrew gelman stats-2013-08-07-When you’re planning on fitting a model, build up to it by fitting simpler models first. Then, once you have a model you like, check the hell out of it

17 0.84972441 320 andrew gelman stats-2010-10-05-Does posterior predictive model checking fit with the operational subjective approach?

18 0.84319621 1817 andrew gelman stats-2013-04-21-More on Bayesian model selection in high-dimensional settings

19 0.83499807 1047 andrew gelman stats-2011-12-08-I Am Too Absolutely Heteroskedastic for This Probit Model

20 0.83331811 774 andrew gelman stats-2011-06-20-The pervasive twoishness of statistics; in particular, the “sampling distribution” and the “likelihood” are two different models, and that’s a good thing


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(16, 0.031), (21, 0.029), (24, 0.089), (76, 0.014), (86, 0.021), (99, 0.702)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9994337 1431 andrew gelman stats-2012-07-27-Overfitting

Introduction: Ilya Esteban writes: In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc., all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best. While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? If you try several different

2 0.99924952 1425 andrew gelman stats-2012-07-23-Examples of the use of hierarchical modeling to generalize to new settings

Introduction: In a link to our back-and-forth on causal inference and the use of hierarchical models to bridge between different inferential settings, Elias Bareinboim (a computer scientist who is working with Judea Pearl) writes : In the past week, I have been engaged in a discussion with Andrew Gelman and his blog readers regarding causal inference, selection bias, confounding, and generalizability. I was trying to understand how his method which he calls “hierarchical modeling” would handle these issues and what guarantees it provides. . . . If anyone understands how “hierarchical modeling” can solve a simple toy problem (e.g., M-bias, control of confounding, mediation, generalizability), please share with us. In his post, Bareinboim raises a direct question about hierarchical modeling and also indirectly brings up larger questions about what is convincing evidence when evaluating a statistical method. As I wrote earlier, Bareinboim believes that “The only way investigators can decide w

3 0.99894053 521 andrew gelman stats-2011-01-17-“the Tea Party’s ire, directed at Democrats and Republicans alike”

Introduction: Mark Lilla recalls some recent Barack Obama quotes and then writes : If this is the way the president and his party think about human psychology, it’s little wonder they’ve taken such a beating. In the spirit of that old line, “That and $4.95 will get you a tall latte,” let me agree with Lilla and attribute the Democrats’ losses in 2010 to the following three factors: 1. A poor understanding of human psychology; 2. The Democrats holding unified control of the presidency and congress with a large majority in both houses (factors that are historically associated with big midterm losses); and 3. A terrible economy. I will let you, the readers, make your best guesses as to the relative importance of factors 1, 2, and 3 above. Don’t get me wrong: I think psychology is important, as is the history of ideas (the main subject of Lilla’s article), and I’d hope that Obama (and also his colleagues in both parties in congress) can become better acquainted with psychology, moti

4 0.99861437 809 andrew gelman stats-2011-07-19-“One of the easiest ways to differentiate an economist from almost anyone else in society”

Introduction: I think I’m starting to resolve a puzzle that’s been bugging me for awhile. Pop economists (or, at least, pop micro-economists) are often making one of two arguments: 1. People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist. 2. People are irrational and they need economists, with their open minds, to show them how to be rational and efficient. Argument 1 is associated with “why do they do that?” sorts of puzzles. Why do they charge so much for candy at the movie theater, why are airline ticket prices such a mess, why are people drug addicts, etc. The usual answer is that there’s some rational reason for what seems like silly or self-destructive behavior. Argument 2 is associated with “we can do better” claims such as why we should fire 80% of public-schools teachers or Moneyball-style stories about how some clever entrepreneur has made a zillion dollars by exploiting some inefficienc

5 0.99835753 638 andrew gelman stats-2011-03-30-More on the correlation between statistical and political ideology

Introduction: This is a chance for me to combine two of my interests–politics and statistics–and probably to irritate both halves of the readership of this blog. Anyway… I recently wrote about the apparent correlation between Bayes/non-Bayes statistical ideology and liberal/conservative political ideology: The Bayes/non-Bayes fissure had a bit of a political dimension–with anti-Bayesians being the old-line conservatives (for example, Ronald Fisher) and Bayesians having a more of a left-wing flavor (for example, Dennis Lindley). Lots of counterexamples at an individual level, but my impression is that on average the old curmudgeonly, get-off-my-lawn types were (with some notable exceptions) more likely to be anti-Bayesian. This was somewhat based on my experiences at Berkeley. Actually, some of the cranky anti-Bayesians were probably Democrats as well, but when they were being anti-Bayesian they seemed pretty conservative. Recently I received an interesting item from Gerald Cliff, a pro

6 0.99834394 589 andrew gelman stats-2011-02-24-On summarizing a noisy scatterplot with a single comparison of two points

7 0.99813539 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable

8 0.99757481 772 andrew gelman stats-2011-06-17-Graphical tools for understanding multilevel models

9 0.99750251 1315 andrew gelman stats-2012-05-12-Question 2 of my final exam for Design and Analysis of Sample Surveys

10 0.99742627 1434 andrew gelman stats-2012-07-29-FindTheData.org

11 0.99707997 1952 andrew gelman stats-2013-07-23-Christakis response to my comment on his comments on social science (or just skip to the P.P.P.S. at the end)

12 0.99698156 1585 andrew gelman stats-2012-11-20-“I know you aren’t the plagiarism police, but . . .”

13 0.9968304 507 andrew gelman stats-2011-01-07-Small world: MIT, asymptotic behavior of differential-difference equations, Susan Assmann, subgroup analysis, multilevel modeling

14 0.99678445 740 andrew gelman stats-2011-06-01-The “cushy life” of a University of Illinois sociology professor

15 0.99614716 756 andrew gelman stats-2011-06-10-Christakis-Fowler update

16 0.99589527 1096 andrew gelman stats-2012-01-02-Graphical communication for legal scholarship

17 0.99588543 180 andrew gelman stats-2010-08-03-Climate Change News

18 0.99525553 1670 andrew gelman stats-2013-01-13-More Bell Labs happy talk

19 0.99508041 1813 andrew gelman stats-2013-04-19-Grad students: Participate in an online survey on statistics education

20 0.99493414 2256 andrew gelman stats-2014-03-20-Teaching Bayesian applied statistics to graduate students in political science, sociology, public health, education, economics, . . .