andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1287 knowledge-graph by maker-knowledge-mining

1287 andrew gelman stats-2012-04-28-Understanding simulations in terms of predictive inference?


meta infos for this blog

Source: html

Introduction: David Hogg writes: My (now deceased) collaborator and guru in all things inference, Sam Roweis, used to emphasize to me that we should evaluate models in the data space — not the parameter space — because models are always effectively “effective” and not really, fundamentally true. Or, in other words, models should be compared in the space of their predictions, not in the space of their parameters (the parameters didn’t really “exist” at all for Sam). In that spirit, when we estimate the effectiveness of a MCMC method or tuning — by autocorrelation time or ESJD or anything else — shouldn’t we be looking at the changes in the model predictions over time, rather than the changes in the parameters over time? That is, the autocorrelation time should be the autocorrelation time in what the model (at the walker position) predicts for the data, and the ESJD should be the expected squared jump distance in what the model predicts for the data? This might resolve the concern I expressed a


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 David Hogg writes: My (now deceased) collaborator and guru in all things inference, Sam Roweis, used to emphasize to me that we should evaluate models in the data space — not the parameter space — because models are always effectively “effective” and not really, fundamentally true. [sent-1, score-1.167]

2 Or, in other words, models should be compared in the space of their predictions, not in the space of their parameters (the parameters didn’t really “exist” at all for Sam). [sent-2, score-1.239]

3 In that spirit, when we estimate the effectiveness of a MCMC method or tuning — by autocorrelation time or ESJD or anything else — shouldn’t we be looking at the changes in the model predictions over time, rather than the changes in the parameters over time? [sent-3, score-1.556]

4 That is, the autocorrelation time should be the autocorrelation time in what the model (at the walker position) predicts for the data, and the ESJD should be the expected squared jump distance in what the model predicts for the data? [sent-4, score-1.693]

5 This might resolve the concern I expressed a few months ago to you that the ESJD is not affine-invariant, and etc. [sent-5, score-0.064]

6 Hogg continues with an example: Imagine you have a three-planet model for some radial velocity data. [sent-7, score-0.204]

7 Sometimes we have a redundant parameterization in which the individual parameters are not identified, but predictions are well-identified. [sent-10, score-0.897]

8 For a simple example, suppose you have a model, y ~ N (a+b, 1), with a uniform prior distribution on (a,b). [sent-11, score-0.1]

9 Then your data don’t tell you anything about a or b, but you can get good inference for a+b and good predictions for new data from the same model. [sent-12, score-0.559]

10 On the other hand, if you want to make a prediction for new data z ~ N(a,1), you’re out of luck. [sent-13, score-0.2]

11 More generally, one problem I have with the hard-line predictivist stance—the idea that models and parameters are mere fictions whereas predictions are real—is that models and parameters can be thought of as bridges between the data of yesterday and the data of tomorrow. [sent-14, score-1.571]

12 It’s not just part of a prediction for some particular measurement. [sent-16, score-0.094]

13 For a more humble example, consider our discussion of physiologically-based pharmacokinetics models in Section 4. [sent-18, score-0.388]

14 In a Bayesian model, good parameterization can be important, as it is typically through the parameters that we put in prior information. [sent-20, score-0.631]

15 In many ways, the parameterization represents a key source of prior information. [sent-21, score-0.316]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('autocorrelation', 0.414), ('parameters', 0.315), ('predictions', 0.285), ('esjd', 0.277), ('space', 0.231), ('parameterization', 0.216), ('planets', 0.185), ('models', 0.147), ('hogg', 0.14), ('sam', 0.131), ('predicts', 0.128), ('model', 0.116), ('time', 0.112), ('data', 0.106), ('prior', 0.1), ('deceased', 0.098), ('pharmacokinetics', 0.098), ('swapping', 0.098), ('prediction', 0.094), ('guru', 0.088), ('velocity', 0.088), ('changes', 0.087), ('bridges', 0.085), ('separated', 0.085), ('parameter', 0.085), ('humble', 0.083), ('realistically', 0.083), ('permutation', 0.081), ('walker', 0.081), ('redundant', 0.081), ('bois', 0.079), ('tuning', 0.077), ('infinite', 0.073), ('squared', 0.072), ('stance', 0.071), ('collaborator', 0.07), ('modes', 0.068), ('mere', 0.065), ('universal', 0.064), ('resolve', 0.064), ('effectiveness', 0.063), ('inference', 0.062), ('spirit', 0.062), ('fundamentally', 0.062), ('switch', 0.062), ('consider', 0.06), ('speed', 0.06), ('finite', 0.06), ('mcmc', 0.059), ('implementation', 0.058)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999988 1287 andrew gelman stats-2012-04-28-Understanding simulations in terms of predictive inference?

Introduction: David Hogg writes: My (now deceased) collaborator and guru in all things inference, Sam Roweis, used to emphasize to me that we should evaluate models in the data space — not the parameter space — because models are always effectively “effective” and not really, fundamentally true. Or, in other words, models should be compared in the space of their predictions, not in the space of their parameters (the parameters didn’t really “exist” at all for Sam). In that spirit, when we estimate the effectiveness of a MCMC method or tuning — by autocorrelation time or ESJD or anything else — shouldn’t we be looking at the changes in the model predictions over time, rather than the changes in the parameters over time? That is, the autocorrelation time should be the autocorrelation time in what the model (at the walker position) predicts for the data, and the ESJD should be the expected squared jump distance in what the model predicts for the data? This might resolve the concern I expressed a

2 0.2941128 650 andrew gelman stats-2011-04-05-Monitor the efficiency of your Markov chain sampler using expected squared jumped distance!

Introduction: Marc Tanguay writes in with a specific question that has a very general answer. First, the question: I [Tanguay] am currently running a MCMC for which I have 3 parameters that are restricted to a specific space. 2 are bounded between 0 and 1 while the third is binary and updated by a Beta-Binomial. Since my priors are also bounded, I notice that, conditional on All the rest (which covers both data and other parameters), the density was not varying a lot within the space of the parameters. As a result, the acceptance rate is high, about 85%, and this despite the fact that all the parameter’s space is explore. Since in your book, the optimal acceptance rates prescribed are lower that 50% (in case of multiple parameters), do you think I should worry about getting 85%. Or is this normal given the restrictions on the parameters? First off: Yes, my guess is that you should be taking bigger jumps. 85% seems like too high an acceptance rate for Metropolis jumping. More generally, t

3 0.1727702 160 andrew gelman stats-2010-07-23-Unhappy with improvement by a factor of 10^29

Introduction: I have an optimization problem: I have a complicated physical model that predicts energy and thermal behavior of a building, given the values of a slew of parameters, such as insulation effectiveness, window transmissivity, etc. I’m trying to find the parameter set that best fits several weeks of thermal and energy use data from the real building that we modeled. (Of course I would rather explore parameter space and come up with probability distributions for the parameters, and maybe that will come later, but for now I’m just optimizing). To do the optimization, colleagues and I implemented a “particle swarm optimization” algorithm on a massively parallel machine. This involves giving each of about 120 “particles” an initial position in parameter space, then letting them move around, trying to move to better positions according to a specific algorithm. We gave each particle an initial position sampled from our prior distribution for each parameter. So far we’ve run about 140 itera

4 0.16049966 858 andrew gelman stats-2011-08-17-Jumping off the edge of the world

Introduction: Tomas Iesmantas writes: I’m facing a problem where parameter space is bounded, e.g. all parameters have to be positive. If in MCMC as proposal distribution I use normal distribution, then at some iterations I get negative proposals. So my question is: should I use recalculation of acceptance probability every time I reject the proposal (something like in delayed rejection method), or I have to use another proposal (like lognormal, truncated normal, etc.)? The simplest solution is to just calculate p(theta)=0 for theta outside the legal region, thus reject those jumps. This will work fine (just remember that when you reject, you have to stay at the last value for one more iteration), but if you’re doing these rejections all the time, you might want to reparameterize your space, for example using logs for positive parameters, logits for constrained parameters, and softmax for parameters that are constrained to sum to 1.

5 0.15622824 1392 andrew gelman stats-2012-06-26-Occam

Introduction: Cosma Shalizi and Larry Wasserman discuss some papers from a conference on Ockham’s Razor. I don’t have anything new to add on this so let me link to past blog entries on the topic and repost the following from 2004 : A lot has been written in statistics about “parsimony”—that is, the desire to explain phenomena using fewer parameters–but I’ve never seen any good general justification for parsimony. (I don’t count “Occam’s Razor,” or “Ockham’s Razor,” or whatever, as a justification. You gotta do better than digging up a 700-year-old quote.) Maybe it’s because I work in social science, but my feeling is: if you can approximate reality with just a few parameters, fine. If you can use more parameters to fold in more information, that’s even better. In practice, I often use simple models—because they are less effort to fit and, especially, to understand. But I don’t kid myself that they’re better than more complicated efforts! My favorite quote on this comes from Rad

6 0.15067902 1474 andrew gelman stats-2012-08-29-More on scaled-inverse Wishart and prior independence

7 0.14932591 754 andrew gelman stats-2011-06-09-Difficulties with Bayesian model averaging

8 0.14505801 1466 andrew gelman stats-2012-08-22-The scaled inverse Wishart prior distribution for a covariance matrix in a hierarchical model

9 0.14388536 1946 andrew gelman stats-2013-07-19-Prior distributions on derived quantities rather than on parameters themselves

10 0.14262018 1941 andrew gelman stats-2013-07-16-Priors

11 0.1382319 2200 andrew gelman stats-2014-02-05-Prior distribution for a predicted probability

12 0.1381318 1144 andrew gelman stats-2012-01-29-How many parameters are in a multilevel model?

13 0.13770947 2129 andrew gelman stats-2013-12-10-Cross-validation and Bayesian estimation of tuning parameters

14 0.13695857 2208 andrew gelman stats-2014-02-12-How to think about “identifiability” in Bayesian inference?

15 0.13501868 779 andrew gelman stats-2011-06-25-Avoiding boundary estimates using a prior distribution as regularization

16 0.13019656 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis

17 0.12570213 1406 andrew gelman stats-2012-07-05-Xiao-Li Meng and Xianchao Xie rethink asymptotics

18 0.12509732 931 andrew gelman stats-2011-09-29-Hamiltonian Monte Carlo stories

19 0.12101801 1208 andrew gelman stats-2012-03-11-Gelman on Hennig on Gelman on Bayes

20 0.11739869 547 andrew gelman stats-2011-01-31-Using sample size in the prior distribution


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.178), (1, 0.182), (2, 0.014), (3, 0.089), (4, 0.003), (5, -0.006), (6, 0.062), (7, -0.027), (8, -0.029), (9, 0.053), (10, 0.031), (11, 0.026), (12, -0.026), (13, -0.022), (14, -0.094), (15, -0.016), (16, 0.009), (17, -0.011), (18, 0.025), (19, -0.013), (20, -0.006), (21, -0.055), (22, -0.031), (23, -0.019), (24, 0.015), (25, 0.016), (26, -0.037), (27, 0.003), (28, 0.057), (29, 0.01), (30, -0.033), (31, -0.044), (32, -0.019), (33, -0.01), (34, -0.022), (35, -0.005), (36, -0.025), (37, -0.045), (38, 0.018), (39, 0.001), (40, -0.01), (41, 0.043), (42, -0.066), (43, 0.024), (44, -0.014), (45, -0.031), (46, -0.0), (47, 0.024), (48, 0.062), (49, -0.02)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95049101 1287 andrew gelman stats-2012-04-28-Understanding simulations in terms of predictive inference?

Introduction: David Hogg writes: My (now deceased) collaborator and guru in all things inference, Sam Roweis, used to emphasize to me that we should evaluate models in the data space — not the parameter space — because models are always effectively “effective” and not really, fundamentally true. Or, in other words, models should be compared in the space of their predictions, not in the space of their parameters (the parameters didn’t really “exist” at all for Sam). In that spirit, when we estimate the effectiveness of a MCMC method or tuning — by autocorrelation time or ESJD or anything else — shouldn’t we be looking at the changes in the model predictions over time, rather than the changes in the parameters over time? That is, the autocorrelation time should be the autocorrelation time in what the model (at the walker position) predicts for the data, and the ESJD should be the expected squared jump distance in what the model predicts for the data? This might resolve the concern I expressed a

2 0.83257246 1459 andrew gelman stats-2012-08-15-How I think about mixture models

Introduction: Larry Wasserman refers to finite mixture models as “beasts” and writes jokes that they “should be avoided at all costs.” I’ve thought a lot about mixture models, ever since using them in an analysis of voting patterns that was published in 1990. First off, I’d like to say that our model was useful so I’d prefer not to pay the cost of avoiding it. For a quick description of our mixture model and its context, see pp. 379-380 of my article in the Jim Berger volume). Actually, our case was particularly difficult because we were not even fitting a mixture model to data, we were fitting it to latent data and using the model to perform partial pooling. My difficulties in trying to fit this model inspired our discussion of mixture models in Bayesian Data Analysis (page 109 in the second edition, in the section on “Counterexamples to the theorems”). I agree with Larry that if you’re fitting a mixture model, it’s good to be aware of the problems that arise if you try to estimate

3 0.81864262 398 andrew gelman stats-2010-11-06-Quote of the day

Introduction: “A statistical model is usually taken to be summarized by a likelihood, or a likelihood and a prior distribution, but we go an extra step by noting that the parameters of a model are typically batched, and we take this batching as an essential part of the model.”

4 0.81012803 1374 andrew gelman stats-2012-06-11-Convergence Monitoring for Non-Identifiable and Non-Parametric Models

Introduction: Becky Passonneau and colleagues at the Center for Computational Learning Systems (CCLS) at Columbia have been working on a project for ConEd (New York’s major electric utility) to rank structures based on vulnerability to secondary events (e.g., transformer explosions, cable meltdowns, electrical fires). They’ve been using the R implementation BayesTree of Chipman, George and McCulloch’s Bayesian Additive Regression Trees (BART). BART is a Bayesian non-parametric method that is non-identifiable in two ways. Firstly, it is an additive tree model with a fixed number of trees, the indexes of which aren’t identified (you get the same predictions in a model swapping the order of the trees). This is the same kind of non-identifiability you get with any mixture model (additive or interpolated) with an exchangeable prior on the mixture components. Secondly, the trees themselves have varying structure over samples in terms of number of nodes and their topology (depth, branching, etc

5 0.80301458 160 andrew gelman stats-2010-07-23-Unhappy with improvement by a factor of 10^29

Introduction: I have an optimization problem: I have a complicated physical model that predicts energy and thermal behavior of a building, given the values of a slew of parameters, such as insulation effectiveness, window transmissivity, etc. I’m trying to find the parameter set that best fits several weeks of thermal and energy use data from the real building that we modeled. (Of course I would rather explore parameter space and come up with probability distributions for the parameters, and maybe that will come later, but for now I’m just optimizing). To do the optimization, colleagues and I implemented a “particle swarm optimization” algorithm on a massively parallel machine. This involves giving each of about 120 “particles” an initial position in parameter space, then letting them move around, trying to move to better positions according to a specific algorithm. We gave each particle an initial position sampled from our prior distribution for each parameter. So far we’ve run about 140 itera

6 0.80015928 1460 andrew gelman stats-2012-08-16-“Real data can be a pain”

7 0.79796523 1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model

8 0.78099132 1363 andrew gelman stats-2012-06-03-Question about predictive checks

9 0.77453578 2208 andrew gelman stats-2014-02-12-How to think about “identifiability” in Bayesian inference?

10 0.77151716 754 andrew gelman stats-2011-06-09-Difficulties with Bayesian model averaging

11 0.76991963 1431 andrew gelman stats-2012-07-27-Overfitting

12 0.76664865 1465 andrew gelman stats-2012-08-21-D. Buggin

13 0.76567036 669 andrew gelman stats-2011-04-19-The mysterious Gamma (1.4, 0.4)

14 0.76181382 2299 andrew gelman stats-2014-04-21-Stan Model of the Week: Hierarchical Modeling of Supernovas

15 0.75776988 773 andrew gelman stats-2011-06-18-Should we always be using the t and robit instead of the normal and logit?

16 0.75698388 1723 andrew gelman stats-2013-02-15-Wacky priors can work well?

17 0.75583971 184 andrew gelman stats-2010-08-04-That half-Cauchy prior

18 0.75420082 810 andrew gelman stats-2011-07-20-Adding more information can make the variance go up (depending on your model)

19 0.75415343 20 andrew gelman stats-2010-05-07-Bayesian hierarchical model for the prediction of soccer results

20 0.75245064 1401 andrew gelman stats-2012-06-30-David Hogg on statistics


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(6, 0.078), (12, 0.016), (13, 0.017), (16, 0.045), (20, 0.127), (21, 0.059), (24, 0.194), (42, 0.016), (56, 0.016), (64, 0.011), (86, 0.019), (95, 0.024), (97, 0.015), (99, 0.245)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9426356 1287 andrew gelman stats-2012-04-28-Understanding simulations in terms of predictive inference?

Introduction: David Hogg writes: My (now deceased) collaborator and guru in all things inference, Sam Roweis, used to emphasize to me that we should evaluate models in the data space — not the parameter space — because models are always effectively “effective” and not really, fundamentally true. Or, in other words, models should be compared in the space of their predictions, not in the space of their parameters (the parameters didn’t really “exist” at all for Sam). In that spirit, when we estimate the effectiveness of a MCMC method or tuning — by autocorrelation time or ESJD or anything else — shouldn’t we be looking at the changes in the model predictions over time, rather than the changes in the parameters over time? That is, the autocorrelation time should be the autocorrelation time in what the model (at the walker position) predicts for the data, and the ESJD should be the expected squared jump distance in what the model predicts for the data? This might resolve the concern I expressed a

2 0.9354322 479 andrew gelman stats-2010-12-20-WWJD? U can find out!

Introduction: Two positions open in the statistics group at the NYU education school. If you get the job, you get to work with Jennifer HIll! One position is a postdoctoral fellowship, and the other is a visiting professorship. The latter position requires “the demonstrated ability to develop a nationally recognized research program,” which seems like a lot to ask for a visiting professor. Do they expect the visiting prof to develop a nationally recognized research program and then leave it there at NYU after the visit is over? In any case, Jennifer and her colleagues are doing excellent work, both applied and methodological, and this seems like a great opportunity.

3 0.92082185 1913 andrew gelman stats-2013-06-24-Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests

Introduction: I’m reposing this classic from 2011 . . . Peter Bergman pointed me to this discussion from Cyrus of a presentation by Guido Imbens on design of randomized experiments. Cyrus writes: The standard analysis that Imbens proposes includes (1) a Fisher-type permutation test of the sharp null hypothesis–what Imbens referred to as “testing”–along with a (2) Neyman-type point estimate of the sample average treatment effect and confidence interval–what Imbens referred to as “estimation.” . . . Imbens claimed that testing and estimation are separate enterprises with separate goals and that the two should not be confused. I [Cyrus] took it as a warning against proposals that use “inverted” tests in order to produce point estimates and confidence intervals. There is no reason that such confidence intervals will have accurate coverage except under rather dire assumptions, meaning that they are not “confidence intervals” in the way that we usually think of them. I agree completely. T

4 0.91880953 870 andrew gelman stats-2011-08-25-Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests

Introduction: Peter Bergman points me to this discussion from Cyrus of a presentation by Guido Imbens on design of randomized experiments. Cyrus writes: The standard analysis that Imbens proposes includes (1) a Fisher-type permutation test of the sharp null hypothesis–what Imbens referred to as “testing”–along with a (2) Neyman-type point estimate of the sample average treatment effect and confidence interval–what Imbens referred to as “estimation.” . . . Imbens claimed that testing and estimation are separate enterprises with separate goals and that the two should not be confused. I [Cyrus] took it as a warning against proposals that use “inverted” tests in order to produce point estimates and confidence intervals. There is no reason that such confidence intervals will have accurate coverage except under rather dire assumptions, meaning that they are not “confidence intervals” in the way that we usually think of them. I agree completely. This is something I’ve been saying for a long

5 0.90942764 1206 andrew gelman stats-2012-03-10-95% intervals that I don’t believe, because they’re from a flat prior I don’t believe

Introduction: Arnaud Trolle (no relation ) writes: I have a question about the interpretation of (non-)overlapping of 95% credibility intervals. In a Bayesian ANOVA (a within-subjects one), I computed 95% credibility intervals about the main effects of a factor. I’d like to compare two by two the main effects across the different conditions of the factor. Can I directly interpret the (non-)overlapping of these credibility intervals and make the following statements: “As the 95% credibility intervals do not overlap, both conditions have significantly different main effects” or conversely “As the 95% credibility intervals overlap, the main effects of both conditions are not significantly different, i.e. equivalent”? I heard that, in the case of classical confidence intervals, the second statement is false, but what happens when working within a Bayesian framework? My reply: I think it makes more sense to directly look at inference for the difference. Also, your statements about equivalence

6 0.90446442 900 andrew gelman stats-2011-09-11-Symptomatic innumeracy

7 0.90349936 1270 andrew gelman stats-2012-04-19-Demystifying Blup

8 0.90168607 1881 andrew gelman stats-2013-06-03-Boot

9 0.89993083 2208 andrew gelman stats-2014-02-12-How to think about “identifiability” in Bayesian inference?

10 0.89845276 2029 andrew gelman stats-2013-09-18-Understanding posterior p-values

11 0.89560378 1409 andrew gelman stats-2012-07-08-Is linear regression unethical in that it gives more weight to cases that are far from the average?

12 0.89541113 1792 andrew gelman stats-2013-04-07-X on JLP

13 0.89538872 2316 andrew gelman stats-2014-05-03-“The graph clearly shows that mammography adds virtually nothing to survival and if anything, decreases survival (and increases cost and provides unnecessary treatment)”

14 0.89406723 502 andrew gelman stats-2011-01-04-Cash in, cash out graph

15 0.89295691 1150 andrew gelman stats-2012-02-02-The inevitable problems with statistical significance and 95% intervals

16 0.89278847 2201 andrew gelman stats-2014-02-06-Bootstrap averaging: Examples where it works and where it doesn’t work

17 0.89208686 896 andrew gelman stats-2011-09-09-My homework success

18 0.89103246 1757 andrew gelman stats-2013-03-11-My problem with the Lindley paradox

19 0.89015555 1080 andrew gelman stats-2011-12-24-Latest in blog advertising

20 0.89011419 1465 andrew gelman stats-2012-08-21-D. Buggin