andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-1999 knowledge-graph by maker-knowledge-mining

1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model


meta infos for this blog

Source: html

Introduction: Nick Firoozye writes: I had a question about BMA [Bayesian model averaging] and model combinations in general, and direct it to you since they are a basic form of hierarchical model, albeit in the simplest of forms. I wanted to ask what the underlying assumptions are that could lead to BMA improving on a larger model. I know model combination is a topic of interest in the (frequentist) econometrics community (e.g., Bates & Granger, http://www.jstor.org/discover/10.2307/3008764?uid=3738032&uid;=2&uid;=4&sid;=21101948653381) but at the time it was considered a bit of a puzzle. Perhaps small models combined outperform a big model due to standard errors, insufficient data, etc. But I haven’t seen much in way of Bayesian justification. In simplest terms, you might have a joint density P(Y,theta_1,theta_2) from which you could use the two marginals P(Y,theta_1) and P(Y,theta_2) to derive two separate forecasts. A BMA-er would do a weighted average of the two forecast densities, having p


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Nick Firoozye writes: I had a question about BMA [Bayesian model averaging] and model combinations in general, and direct it to you since they are a basic form of hierarchical model, albeit in the simplest of forms. [sent-1, score-0.983]

2 I know model combination is a topic of interest in the (frequentist) econometrics community (e. [sent-3, score-0.356]

3 Perhaps small models combined outperform a big model due to standard errors, insufficient data, etc. [sent-10, score-0.58]

4 In simplest terms, you might have a joint density P(Y,theta_1,theta_2) from which you could use the two marginals P(Y,theta_1) and P(Y,theta_2) to derive two separate forecasts. [sent-12, score-0.435]

5 A BMA-er would do a weighted average of the two forecast densities, having previously had a model prior. [sent-13, score-0.456]

6 Is it an inability to impose proper priors on the larger parameter space? [sent-21, score-0.322]

7 Of course I’m thinking in terms of simple easily combined models (e. [sent-24, score-0.313]

8 , a regression on two variables), and a BMA-er could easily combine far more challenging models that don’t naturally form a supermodel. [sent-26, score-0.371]

9 My reply: Conditional on being required to use noninformative priors on each submodel, the strategy of model averaging or model selection can be better than using the larger model. [sent-27, score-1.064]

10 But I agree that, if you’re thinking of fitting the small model or the large model, it makes more sense to use an informative prior that allows for shrinkage directly. [sent-28, score-0.564]

11 I believe some of the early examples of BMA involved running 2^k regressions and averaging rather than running a single k dimensional regression with shrinkage (so much simpler doing a lasso,… er…, shrinkage estimator, to be honest). [sent-32, score-0.661]

12 And the BMA was meant to be preferable to frequentist sequential model selection or using a criterion which involves the 2^k regressions anyway. [sent-33, score-0.6]

13 And if so aren’t you saying it is better to have a single large non-hierarchical model than it is a hierarchical model? [sent-35, score-0.492]

14 If we had informative priors on both parameter (and hyperparameter) we might have a better model yet? [sent-37, score-0.477]

15 If so, why should we be doing hierarchical models at all, other than they can be far more intuitive than super-models? [sent-38, score-0.475]

16 While Dempster and Shaffer do extend Bayes’ theorem to their special case, it just seems their theory is so more complex than just using a hierarchical model. [sent-40, score-0.368]

17 Second order probability is merely a hierarchical model, with weights being put on a family of probability measures, and so much more intuitive than belief functions. [sent-41, score-0.597]

18 My response: Indeed, discrete model averaging can be seen as a sort of implementation of continuous model expansion in which the probability of setting a coefficient to zero is a way to get some shrinkage. [sent-42, score-0.995]

19 For similar reasons, I have no particular interest in seeing which sets of predictors the fitted model wants me to include. [sent-44, score-0.356]

20 On your larger question: yes, I think hierarchical priors can be useful in specifying dependent uncertainty. [sent-45, score-0.467]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('bma', 0.413), ('model', 0.285), ('phi', 0.24), ('hierarchical', 0.207), ('averaging', 0.171), ('theta', 0.168), ('shrinkage', 0.146), ('simplest', 0.141), ('marginals', 0.138), ('larger', 0.131), ('firoozye', 0.13), ('priors', 0.129), ('easily', 0.124), ('frequentist', 0.116), ('probability', 0.109), ('insufficient', 0.106), ('models', 0.103), ('intuitive', 0.099), ('extend', 0.097), ('weighted', 0.093), ('combined', 0.086), ('two', 0.078), ('regressions', 0.076), ('discrete', 0.076), ('functions', 0.076), ('belief', 0.073), ('interest', 0.071), ('prior', 0.07), ('continuous', 0.069), ('bates', 0.069), ('imprecise', 0.069), ('separable', 0.069), ('tractable', 0.069), ('collinearity', 0.069), ('shaffer', 0.069), ('far', 0.066), ('question', 0.065), ('expands', 0.065), ('granger', 0.065), ('hyperprior', 0.065), ('complex', 0.064), ('informative', 0.063), ('lead', 0.063), ('selection', 0.063), ('er', 0.062), ('inability', 0.062), ('running', 0.061), ('dempster', 0.06), ('preferable', 0.06), ('incompatible', 0.06)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model

Introduction: Nick Firoozye writes: I had a question about BMA [Bayesian model averaging] and model combinations in general, and direct it to you since they are a basic form of hierarchical model, albeit in the simplest of forms. I wanted to ask what the underlying assumptions are that could lead to BMA improving on a larger model. I know model combination is a topic of interest in the (frequentist) econometrics community (e.g., Bates & Granger, http://www.jstor.org/discover/10.2307/3008764?uid=3738032&uid;=2&uid;=4&sid;=21101948653381) but at the time it was considered a bit of a puzzle. Perhaps small models combined outperform a big model due to standard errors, insufficient data, etc. But I haven’t seen much in way of Bayesian justification. In simplest terms, you might have a joint density P(Y,theta_1,theta_2) from which you could use the two marginals P(Y,theta_1) and P(Y,theta_2) to derive two separate forecasts. A BMA-er would do a weighted average of the two forecast densities, having p

2 0.24744451 1941 andrew gelman stats-2013-07-16-Priors

Introduction: Nick Firoozye writes: While I am absolutely sympathetic to the Bayesian agenda I am often troubled by the requirement of having priors. We must have priors on the parameter of an infinite number of model we have never seen before and I find this troubling. There is a similarly troubling problem in economics of utility theory. Utility is on consumables. To be complete a consumer must assign utility to all sorts of things they never would have encountered. More recent versions of utility theory instead make consumption goods a portfolio of attributes. Cadillacs are x many units of luxury y of transport etc etc. And we can automatically have personal utilities to all these attributes. I don’t ever see parameters. Some model have few and some have hundreds. Instead, I see data. So I don’t know how to have an opinion on parameters themselves. Rather I think it far more natural to have opinions on the behavior of models. The prior predictive density is a good and sensible notion. Also

3 0.24595267 754 andrew gelman stats-2011-06-09-Difficulties with Bayesian model averaging

Introduction: In response to this article by Cosma Shalizi and myself on the philosophy of Bayesian statistics, David Hogg writes: I [Hogg] agree–even in physics and astronomy–that the models are not “True” in the God-like sense of being absolute reality (that is, I am not a realist); and I have argued (a philosophically very naive paper, but hey, I was new to all this) that for pretty fundamental reasons we could never arrive at the True (with a capital “T”) model of the Universe. The goal of inference is to find the “best” model, where “best” might have something to do with prediction, or explanation, or message length, or (horror!) our utility. Needless to say, most of my physics friends *are* realists, even in the face of “effective theories” as Newtonian mechanics is an effective theory of GR and GR is an effective theory of “quantum gravity” (this plays to your point, because if you think any theory is possibly an effective theory, how could you ever find Truth?). I also liked the i

4 0.20011364 840 andrew gelman stats-2011-08-05-An example of Bayesian model averaging

Introduction: Jay Ulfelder writes: I see that you blogged about limitations of Bayesian model averaging. As it happens, I was also blogging about BMA, but with an example where it seems to be working reasonably well, at least for the narrow purpose of forecasting. The topic is the analysis I did for CFR earlier this year on nonviolent uprisings. I don’t have time to look into this one but I wanted to pass it on.

5 0.19751696 2109 andrew gelman stats-2013-11-21-Hidden dangers of noninformative priors

Introduction: Following up on Christian’s post [link fixed] on the topic, I’d like to offer a few thoughts of my own. In BDA, we express the idea that a noninformative prior is a placeholder: you can use the noninformative prior to get the analysis started, then if your posterior distribution is less informative than you would like, or if it does not make sense, you can go back and add prior information. Same thing for the data model (the “likelihood”), for that matter: it often makes sense to start with something simple and conventional and then go from there. So, in that sense, noninformative priors are no big deal, they’re just a way to get started. Just don’t take them too seriously. Traditionally in statistics we’ve worked with the paradigm of a single highly informative dataset with only weak external information. But if the data are sparse and prior information is strong, we have to think differently. And, when you increase the dimensionality of a problem, both these things hap

6 0.19624834 1247 andrew gelman stats-2012-04-05-More philosophy of Bayes

7 0.19587457 1431 andrew gelman stats-2012-07-27-Overfitting

8 0.19445369 1392 andrew gelman stats-2012-06-26-Occam

9 0.18356185 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis

10 0.17940874 1972 andrew gelman stats-2013-08-07-When you’re planning on fitting a model, build up to it by fitting simpler models first. Then, once you have a model you like, check the hell out of it

11 0.17770873 1946 andrew gelman stats-2013-07-19-Prior distributions on derived quantities rather than on parameters themselves

12 0.17494076 1868 andrew gelman stats-2013-05-23-Validation of Software for Bayesian Models Using Posterior Quantiles

13 0.17488614 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes

14 0.17314535 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients

15 0.17148866 1425 andrew gelman stats-2012-07-23-Examples of the use of hierarchical modeling to generalize to new settings

16 0.16864759 811 andrew gelman stats-2011-07-20-Kind of Bayesian

17 0.16776122 2200 andrew gelman stats-2014-02-05-Prior distribution for a predicted probability

18 0.16702412 291 andrew gelman stats-2010-09-22-Philosophy of Bayes and non-Bayes: A dialogue with Deborah Mayo

19 0.16303314 2133 andrew gelman stats-2013-12-13-Flexibility is good

20 0.16185148 1092 andrew gelman stats-2011-12-29-More by Berger and me on weakly informative priors


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.26), (1, 0.303), (2, 0.037), (3, 0.079), (4, -0.016), (5, -0.02), (6, 0.076), (7, -0.01), (8, 0.031), (9, 0.062), (10, 0.033), (11, 0.06), (12, -0.042), (13, -0.016), (14, -0.077), (15, -0.015), (16, 0.042), (17, -0.008), (18, 0.008), (19, -0.008), (20, 0.004), (21, -0.008), (22, -0.027), (23, -0.063), (24, -0.04), (25, 0.005), (26, 0.003), (27, -0.03), (28, -0.003), (29, -0.006), (30, -0.041), (31, -0.014), (32, -0.032), (33, 0.012), (34, -0.029), (35, 0.015), (36, 0.002), (37, -0.017), (38, -0.021), (39, 0.023), (40, -0.001), (41, 0.028), (42, 0.018), (43, 0.01), (44, -0.025), (45, -0.029), (46, 0.029), (47, 0.013), (48, -0.021), (49, 0.039)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97566205 1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model

Introduction: Nick Firoozye writes: I had a question about BMA [Bayesian model averaging] and model combinations in general, and direct it to you since they are a basic form of hierarchical model, albeit in the simplest of forms. I wanted to ask what the underlying assumptions are that could lead to BMA improving on a larger model. I know model combination is a topic of interest in the (frequentist) econometrics community (e.g., Bates & Granger, http://www.jstor.org/discover/10.2307/3008764?uid=3738032&uid;=2&uid;=4&sid;=21101948653381) but at the time it was considered a bit of a puzzle. Perhaps small models combined outperform a big model due to standard errors, insufficient data, etc. But I haven’t seen much in way of Bayesian justification. In simplest terms, you might have a joint density P(Y,theta_1,theta_2) from which you could use the two marginals P(Y,theta_1) and P(Y,theta_2) to derive two separate forecasts. A BMA-er would do a weighted average of the two forecast densities, having p

2 0.90601987 754 andrew gelman stats-2011-06-09-Difficulties with Bayesian model averaging

Introduction: In response to this article by Cosma Shalizi and myself on the philosophy of Bayesian statistics, David Hogg writes: I [Hogg] agree–even in physics and astronomy–that the models are not “True” in the God-like sense of being absolute reality (that is, I am not a realist); and I have argued (a philosophically very naive paper, but hey, I was new to all this) that for pretty fundamental reasons we could never arrive at the True (with a capital “T”) model of the Universe. The goal of inference is to find the “best” model, where “best” might have something to do with prediction, or explanation, or message length, or (horror!) our utility. Needless to say, most of my physics friends *are* realists, even in the face of “effective theories” as Newtonian mechanics is an effective theory of GR and GR is an effective theory of “quantum gravity” (this plays to your point, because if you think any theory is possibly an effective theory, how could you ever find Truth?). I also liked the i

3 0.89681077 1431 andrew gelman stats-2012-07-27-Overfitting

Introduction: Ilya Esteban writes: In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc., all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best. While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? If you try several different

4 0.89593726 1216 andrew gelman stats-2012-03-17-Modeling group-level predictors in a multilevel regression

Introduction: Trey Causey writes: Do you have suggestions as to model selection strategies akin to Bayesian model averaging for multilevel models when level-2 inputs are of substantive interest? I [Causey] have seen plenty of R packages and procedures for non-multilevel models, and tried the glmulti package but found that it did not perform well with more than a few level-2 variables. My quick answer is: with a name like that, you should really be fitting three-level models! My longer answer is: regular readers will be unsurprised to hear that I’m no fan of Bayesian model averaging . Instead I’d prefer to bite the bullet and assign an informative prior distribution on these coefficients. I don’t have a great example of such an analysis but I’m more and more thinking that this is the way to go. I don’t see the point in aiming for the intermediate goal of pruning the predictors; I’d rather have a procedure that includes prior information on the predictors and their interactions.

5 0.89323896 1723 andrew gelman stats-2013-02-15-Wacky priors can work well?

Introduction: Dave Judkins writes: I would love to see a blog entry on this article , Bayesian Model Selection in High-Dimensional Settings, by Valen Johnson and David Rossell. The simulation results are very encouraging although the choice of colors for some of the graphics is unfortunate. Unless I am colorblind in some way that I am unaware of, they have two thin charcoal lines that are indistinguishable. When Dave Judkins puts in a request, I’ll respond. Also, I’m always happy to see a new Val Johnson paper. Val and I are contemporaries—he and I got our PhD’s at around the same time, with both of us working on Bayesian image reconstruction, then in the early 1990s Val was part of the legendary group at Duke’s Institute of Statistics and Decision Sciences—a veritable ’27 Yankees featuring Mike West, Merlise Clyde, Michael Lavine, Dave Higdon, Peter Mueller, Val, and a bunch of others. I always thought it was too bad they all had to go their separate ways. Val also wrote two classic p

6 0.89242691 1247 andrew gelman stats-2012-04-05-More philosophy of Bayes

7 0.89188725 1459 andrew gelman stats-2012-08-15-How I think about mixture models

8 0.87608171 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis

9 0.86814994 1510 andrew gelman stats-2012-09-25-Incoherence of Bayesian data analysis

10 0.86701721 1817 andrew gelman stats-2013-04-21-More on Bayesian model selection in high-dimensional settings

11 0.86430138 1392 andrew gelman stats-2012-06-26-Occam

12 0.86421967 1041 andrew gelman stats-2011-12-04-David MacKay and Occam’s Razor

13 0.85860217 2208 andrew gelman stats-2014-02-12-How to think about “identifiability” in Bayesian inference?

14 0.85794479 1221 andrew gelman stats-2012-03-19-Whassup with deviance having a high posterior correlation with a parameter in the model?

15 0.85150015 1287 andrew gelman stats-2012-04-28-Understanding simulations in terms of predictive inference?

16 0.8481698 1284 andrew gelman stats-2012-04-26-Modeling probability data

17 0.84617984 398 andrew gelman stats-2010-11-06-Quote of the day

18 0.84537667 20 andrew gelman stats-2010-05-07-Bayesian hierarchical model for the prediction of soccer results

19 0.84168613 2342 andrew gelman stats-2014-05-21-Models with constraints

20 0.83672047 810 andrew gelman stats-2011-07-20-Adding more information can make the variance go up (depending on your model)


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(16, 0.042), (21, 0.013), (22, 0.011), (24, 0.473), (27, 0.014), (86, 0.015), (89, 0.015), (99, 0.246)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.99658531 643 andrew gelman stats-2011-04-02-So-called Bayesian hypothesis testing is just as bad as regular hypothesis testing

Introduction: Steve Ziliak points me to this article by the always-excellent Carl Bialik, slamming hypothesis tests. I only wish Carl had talked with me before so hastily posting, though! I would’ve argued with some of the things in the article. In particular, he writes: Reese and Brad Carlin . . . suggest that Bayesian statistics are a better alternative, because they tackle the probability that the hypothesis is true head-on, and incorporate prior knowledge about the variables involved. Brad Carlin does great work in theory, methods, and applications, and I like the bit about the prior knowledge (although I might prefer the more general phrase “additional information”), but I hate that quote! My quick response is that the hypothesis of zero effect is almost never true! The problem with the significance testing framework–Bayesian or otherwise–is in the obsession with the possibility of an exact zero effect. The real concern is not with zero, it’s with claiming a positive effect whe

2 0.99611235 38 andrew gelman stats-2010-05-18-Breastfeeding, infant hyperbilirubinemia, statistical graphics, and modern medicine

Introduction: Dan Lakeland asks : When are statistical graphics potentially life threatening? When they’re poorly designed, and used to make decisions on potentially life threatening topics, like medical decision making, engineering design, and the like. The American Academy of Pediatrics has dropped the ball on communicating to physicians about infant jaundice. Another message in this post is that bad decisions can compound each other. It’s an interesting story (follow the link above for the details), would be great for a class in decision analysis or statistical communication. I have no idea how to get from A to B here, in the sense of persuading hospitals to do this sort of thing better. I’d guess the first step is to carefully lay out costs and benefits. When doctors and nurses make extra precautions for safety, it could be useful to lay out the ultimate goals and estimate the potential costs and benefits of different approaches.

3 0.99532336 938 andrew gelman stats-2011-10-03-Comparing prediction errors

Introduction: Someone named James writes: I’m working on a classification task, sentence segmentation. The classifier algorithm we use (BoosTexter, a boosted learning algorithm) classifies each word independently conditional on its features, i.e. a bag-of-words model, so any contextual clues need to be encoded into the features. The feature extraction system I am proposing in my thesis uses a heteroscedastic LDA to transform data to produce the features the classifier runs on. The HLDA system has a couple parameters I’m testing, and I’m running a 3×2 full factorial experiment. That’s the background which may or may not be relevant to the question. The output of each trial is a class (there are only 2 classes, right now) for every word in the dataset. Because of the nature of the task, one class strongly predominates, say 90-95% of the data. My question is this: in terms of overall performance (we use F1 score), many of these trials are pretty close together, which leads me to ask whethe

4 0.99458539 1092 andrew gelman stats-2011-12-29-More by Berger and me on weakly informative priors

Introduction: A couple days ago we discussed some remarks by Tony O’Hagan and Jim Berger on weakly informative priors. Jim followed up on Deborah Mayo’s blog with this: Objective Bayesian priors are often improper (i.e., have infinite total mass), but this is not a problem when they are developed correctly. But not every improper prior is satisfactory. For instance, the constant prior is known to be unsatisfactory in many situations. The ‘solution’ pseudo-Bayesians often use is to choose a constant prior over a large but bounded set (a ‘weakly informative’ prior), saying it is now proper and so all is well. This is not true; if the constant prior on the whole parameter space is bad, so will be the constant prior over the bounded set. The problem is, in part, that some people confuse proper priors with subjective priors and, having learned that true subjective priors are fine, incorrectly presume that weakly informative proper priors are fine. I have a few reactions to this: 1. I agree

5 0.99302202 241 andrew gelman stats-2010-08-29-Ethics and statistics in development research

Introduction: From Bannerjee and Duflo, “The Experimental Approach to Development Economics,” Annual Review of Economics (2009): One issue with the explicit acknowledgment of randomization as a fair way to allocate the program is that implementers may find that the easiest way to present it to the community is to say that an expansion of the program is planned for the control areas in the future (especially when such is indeed the case, as in phased-in design). I can’t quite figure out whether Bannerjee and Duflo are saying that they would lie and tell people that an expansion is planned when it isn’t, or whether they’re deploring that other people do it. I’m not bothered by a lot of the deception in experimental research–for example, I think the Milgram obedience experiment was just fine–but somehow the above deception bothers me. It just seems wrong to tell people that an expansion is planned if it’s not. P.S. Overall the article is pretty good. My only real problem with it is that

6 0.99281555 1978 andrew gelman stats-2013-08-12-Fixing the race, ethnicity, and national origin questions on the U.S. Census

7 0.99232793 1479 andrew gelman stats-2012-09-01-Mothers and Moms

8 0.99081933 482 andrew gelman stats-2010-12-23-Capitalism as a form of voluntarism

9 0.98895687 545 andrew gelman stats-2011-01-30-New innovations in spam

10 0.98880106 240 andrew gelman stats-2010-08-29-ARM solutions

11 0.98635912 1706 andrew gelman stats-2013-02-04-Too many MC’s not enough MIC’s, or What principles should govern attempts to summarize bivariate associations in large multivariate datasets?

12 0.98623478 1787 andrew gelman stats-2013-04-04-Wanna be the next Tyler Cowen? It’s not as easy as you might think!

13 0.98569643 743 andrew gelman stats-2011-06-03-An argument that can’t possibly make sense

14 0.98412001 1891 andrew gelman stats-2013-06-09-“Heterogeneity of variance in experimental studies: A challenge to conventional interpretations”

15 0.98337561 2229 andrew gelman stats-2014-02-28-God-leaf-tree

16 0.97746038 1046 andrew gelman stats-2011-12-07-Neutral noninformative and informative conjugate beta and gamma prior distributions

17 0.97669756 1376 andrew gelman stats-2012-06-12-Simple graph WIN: the example of birthday frequencies

18 0.9712652 278 andrew gelman stats-2010-09-15-Advice that might make sense for individuals but is negative-sum overall

19 0.97043496 1437 andrew gelman stats-2012-07-31-Paying survey respondents

20 0.96765006 1455 andrew gelman stats-2012-08-12-Probabilistic screening to get an approximate self-weighted sample