andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-852 knowledge-graph by maker-knowledge-mining

852 andrew gelman stats-2011-08-13-Checking your model using fake data


meta infos for this blog

Source: html

Introduction: Someone sent me the following email: I tried to do a logistic regression . . . I programmed the model in different ways and got different answers . . . can’t get the results to match . . . What am I doing wrong? . . . Here’s my code . . . I didn’t have the time to look at his code so I gave the following general response: One way to check things is to try simulating data from the fitted model, then fit your model again to the simulated data and see what happens. P.S. He followed my suggestion and responded a few days later: Yeah, that did the trick! I was treating a factor variable as a covariate! I love it when generic advice works out!


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Someone sent me the following email: I tried to do a logistic regression . [sent-1, score-0.666]

2 I programmed the model in different ways and got different answers . [sent-4, score-1.025]

3 I didn’t have the time to look at his code so I gave the following general response: One way to check things is to try simulating data from the fitted model, then fit your model again to the simulated data and see what happens. [sent-17, score-2.186]

4 He followed my suggestion and responded a few days later: Yeah, that did the trick! [sent-20, score-0.598]

5 I was treating a factor variable as a covariate! [sent-21, score-0.499]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('code', 0.267), ('programmed', 0.241), ('simulating', 0.233), ('covariate', 0.233), ('treating', 0.229), ('simulated', 0.205), ('trick', 0.185), ('generic', 0.183), ('model', 0.181), ('match', 0.173), ('fitted', 0.172), ('suggestion', 0.17), ('yeah', 0.167), ('answers', 0.162), ('following', 0.161), ('responded', 0.159), ('logistic', 0.157), ('factor', 0.145), ('followed', 0.136), ('days', 0.133), ('tried', 0.131), ('advice', 0.131), ('gave', 0.127), ('email', 0.126), ('love', 0.125), ('variable', 0.125), ('works', 0.123), ('check', 0.122), ('different', 0.121), ('sent', 0.119), ('later', 0.118), ('ways', 0.109), ('response', 0.103), ('fit', 0.099), ('regression', 0.098), ('wrong', 0.095), ('try', 0.09), ('got', 0.09), ('someone', 0.087), ('results', 0.085), ('data', 0.083), ('didn', 0.082), ('look', 0.079), ('general', 0.078), ('things', 0.068), ('time', 0.05), ('way', 0.048), ('get', 0.043), ('see', 0.04), ('one', 0.03)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 852 andrew gelman stats-2011-08-13-Checking your model using fake data

Introduction: Someone sent me the following email: I tried to do a logistic regression . . . I programmed the model in different ways and got different answers . . . can’t get the results to match . . . What am I doing wrong? . . . Here’s my code . . . I didn’t have the time to look at his code so I gave the following general response: One way to check things is to try simulating data from the fitted model, then fit your model again to the simulated data and see what happens. P.S. He followed my suggestion and responded a few days later: Yeah, that did the trick! I was treating a factor variable as a covariate! I love it when generic advice works out!

2 0.15201712 328 andrew gelman stats-2010-10-08-Displaying a fitted multilevel model

Introduction: Elissa Brown writes: I’m working on some data using a multinomial model (3 categories for the response & 2 predictors-1 continuous and 1 binary), and I’ve been looking and looking for some sort of nice graphical way to show my model at work. Something like a predicted probabilities plot. I know you can do this for the levels of Y with just one covariate, but is this still a valid way to describe the multinomial model (just doing a pred plot for each covariate)? What’s the deal, is there really no way to graphically represent a successful multinomial model? Also, is it unreasonable to break down your model into a binary response just to get some ROC curves? This seems like cheating. From what I’ve found so far, it seems that people just avoid graphical support when discussing their fitted multinomial models. My reply: It’s hard for me to think about this sort of thing in the abstract with no context. We do have one example in chapter 6 of ARM where we display data and fitted m

3 0.1434236 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?

Introduction: David Karger writes: Your recent post on sharing data was of great interest to me, as my own research in computer science asks how to incentivize and lower barriers to data sharing. I was particularly curious about your highlighting of effort as the major dis-incentive to sharing. I would love to hear more, as this question of effort is on we specifically target in our development of tools for data authoring and publishing. As a straw man, let me point out that sharing data technically requires no more than posting an excel spreadsheet online. And that you likely already produced that spreadsheet during your own analytic work. So, in what way does such low-tech publishing fail to meet your data sharing objectives? Our own hypothesis has been that the effort is really quite low, with the problem being a lack of *immediate/tangible* benefits (as opposed to the long-term values you accurately describe). To attack this problem, we’re developing tools (and, since it appear

4 0.14040753 1445 andrew gelman stats-2012-08-06-Slow progress

Introduction: I received the following message: I am a Psychology postgraduate at the University of Glasgow and am writing for an article request. I’ve just read your 2008 published article titled “A weakly informative default prior distribution for logistic and other regression models” and found from it that your group also wrote a report on applying the Bayesian logistic regression approach to multilevel model, which is titled “An approximate EM algorithm for multilevel generalized linear models”. I have been looking for it online but did find it, and was wondering if I may request this report from you? My first thought is that this is a good sign that psychology undergraduates are reading papers like this. Unfortunately I had to reply as follows: Hi, we actually programmed this up but never debugged it! So no actual paper . . . I think I could’ve done it if I had ever focused on the problem. Between the messiness of the algebra and the messiness of the R code, I never got it all to

5 0.13961866 1735 andrew gelman stats-2013-02-24-F-f-f-fake data

Introduction: Tiago Fragoso writes: Suppose I fit a two stage regression model Y = a + bx + e a = cw + d + e1 I could fit it all in one step by using MCMC for example (my model is more complicated than that, so I’ll have to do it by MCMC). However, I could fit the first regression only using MCMC because those estimates are hard to obtain and perform the second regression using least squares or a separate MCMC. So there’s an ‘one step’ inference based on doing it all at the same time and a ‘two step’ inference by fitting one and using the estimates on the further steps. What is gained or lost between both? Is anything done in this question? My response: Rather than answering your particular question, I’ll give you my generic answer, which is to simulate fake data from your model, then fit your model both ways and see how the results differ. Repeat the simulation a few thousand times and you can make all the statistical comparisons you like.

6 0.13892429 1972 andrew gelman stats-2013-08-07-When you’re planning on fitting a model, build up to it by fitting simpler models first. Then, once you have a model you like, check the hell out of it

7 0.13259037 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science

8 0.13214657 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?

9 0.12747532 154 andrew gelman stats-2010-07-18-Predictive checks for hierarchical models

10 0.11971414 41 andrew gelman stats-2010-05-19-Updated R code and data for ARM

11 0.11887547 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?

12 0.10637002 1875 andrew gelman stats-2013-05-28-Simplify until your fake-data check works, then add complications until you can figure out where the problem is coming from

13 0.10563438 1141 andrew gelman stats-2012-01-28-Using predator-prey models on the Canadian lynx series

14 0.10461496 146 andrew gelman stats-2010-07-14-The statistics and the science

15 0.10297011 1761 andrew gelman stats-2013-03-13-Lame Statistics Patents

16 0.10110158 213 andrew gelman stats-2010-08-17-Matching at two levels

17 0.1009469 976 andrew gelman stats-2011-10-27-Geophysicist Discovers Modeling Error (in Economics)

18 0.098452047 2236 andrew gelman stats-2014-03-07-Selection bias in the reporting of shaky research

19 0.0973837 772 andrew gelman stats-2011-06-17-Graphical tools for understanding multilevel models

20 0.096583739 244 andrew gelman stats-2010-08-30-Useful models, model checking, and external validation: a mini-discussion


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.17), (1, 0.081), (2, 0.013), (3, 0.038), (4, 0.112), (5, 0.007), (6, 0.03), (7, -0.079), (8, 0.1), (9, 0.016), (10, 0.058), (11, 0.054), (12, -0.007), (13, -0.021), (14, -0.049), (15, 0.042), (16, 0.034), (17, -0.067), (18, -0.007), (19, 0.026), (20, 0.03), (21, -0.021), (22, 0.01), (23, -0.084), (24, -0.051), (25, 0.004), (26, 0.024), (27, -0.113), (28, -0.001), (29, -0.013), (30, -0.003), (31, 0.012), (32, -0.033), (33, 0.059), (34, 0.003), (35, -0.031), (36, -0.044), (37, 0.052), (38, -0.045), (39, 0.014), (40, 0.027), (41, 0.017), (42, -0.011), (43, -0.016), (44, 0.044), (45, 0.012), (46, -0.016), (47, -0.021), (48, 0.039), (49, -0.003)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96438128 852 andrew gelman stats-2011-08-13-Checking your model using fake data

Introduction: Someone sent me the following email: I tried to do a logistic regression . . . I programmed the model in different ways and got different answers . . . can’t get the results to match . . . What am I doing wrong? . . . Here’s my code . . . I didn’t have the time to look at his code so I gave the following general response: One way to check things is to try simulating data from the fitted model, then fit your model again to the simulated data and see what happens. P.S. He followed my suggestion and responded a few days later: Yeah, that did the trick! I was treating a factor variable as a covariate! I love it when generic advice works out!

2 0.84894699 1735 andrew gelman stats-2013-02-24-F-f-f-fake data

Introduction: Tiago Fragoso writes: Suppose I fit a two stage regression model Y = a + bx + e a = cw + d + e1 I could fit it all in one step by using MCMC for example (my model is more complicated than that, so I’ll have to do it by MCMC). However, I could fit the first regression only using MCMC because those estimates are hard to obtain and perform the second regression using least squares or a separate MCMC. So there’s an ‘one step’ inference based on doing it all at the same time and a ‘two step’ inference by fitting one and using the estimates on the further steps. What is gained or lost between both? Is anything done in this question? My response: Rather than answering your particular question, I’ll give you my generic answer, which is to simulate fake data from your model, then fit your model both ways and see how the results differ. Repeat the simulation a few thousand times and you can make all the statistical comparisons you like.

3 0.80740601 328 andrew gelman stats-2010-10-08-Displaying a fitted multilevel model

Introduction: Elissa Brown writes: I’m working on some data using a multinomial model (3 categories for the response & 2 predictors-1 continuous and 1 binary), and I’ve been looking and looking for some sort of nice graphical way to show my model at work. Something like a predicted probabilities plot. I know you can do this for the levels of Y with just one covariate, but is this still a valid way to describe the multinomial model (just doing a pred plot for each covariate)? What’s the deal, is there really no way to graphically represent a successful multinomial model? Also, is it unreasonable to break down your model into a binary response just to get some ROC curves? This seems like cheating. From what I’ve found so far, it seems that people just avoid graphical support when discussing their fitted multinomial models. My reply: It’s hard for me to think about this sort of thing in the abstract with no context. We do have one example in chapter 6 of ARM where we display data and fitted m

4 0.80447918 1875 andrew gelman stats-2013-05-28-Simplify until your fake-data check works, then add complications until you can figure out where the problem is coming from

Introduction: I received the following email: I am trying to develop a Bayesian model to represent the process through which individual consumers make online product rating decisions. In my model each individual faces total J product options and for each product option (j) each individual (i) needs to make three sequential decisions: - First he decides whether to consume a specific product option (j) or not (choice decision) - If he decides to consume a product option j, then after consumption he decides whether to rate it or not (incidence decision) - If he decides to rate product j then what finally he decides what rating (k) to assign to it (evaluation decision) We model this decision sequence in terms of three equations. A binary response variable in the first equation represents the choice decision. Another binary response variable in the second equation represents the incidence decision that is observable only when first selection decision is 1. Finally, an ordered response v

5 0.79490149 782 andrew gelman stats-2011-06-29-Putting together multinomial discrete regressions by combining simple logits

Introduction: When predicting 0/1 data we can use logit (or probit or robit or some other robust model such as invlogit (0.01 + 0.98*X*beta)). Logit is simple enough and we can use bayesglm to regularize and avoid the problem of separation. What if there are more than 2 categories? If they’re ordered (1, 2, 3, etc), we can do ordered logit (and use bayespolr() to avoid separation). If the categories are unordered (vanilla, chocolate, strawberry), there are unordered multinomial logit and probit models out there. But it’s not so easy to fit these multinomial model in a multilevel setting (with coefficients that vary by group), especially if the computation is embedded in an iterative routine such as mi where you have real time constraints at each step. So this got me wondering whether we could kluge it with logits. Here’s the basic idea (in the ordered and unordered forms): - If you have a variable that goes 1, 2, 3, etc., set up a series of logits: 1 vs. 2,3,…; 2 vs. 3,…; and so forth

6 0.78336209 861 andrew gelman stats-2011-08-19-Will Stan work well with 40×40 matrices?

7 0.75242013 39 andrew gelman stats-2010-05-18-The 1.6 rule

8 0.7490108 2110 andrew gelman stats-2013-11-22-A Bayesian model for an increasing function, in Stan!

9 0.74223214 1886 andrew gelman stats-2013-06-07-Robust logistic regression

10 0.74130392 375 andrew gelman stats-2010-10-28-Matching for preprocessing data for causal inference

11 0.74008667 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

12 0.73400855 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?

13 0.72458833 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

14 0.72177255 823 andrew gelman stats-2011-07-26-Including interactions or not

15 0.7147193 1141 andrew gelman stats-2012-01-28-Using predator-prey models on the Canadian lynx series

16 0.70597023 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

17 0.7054435 2018 andrew gelman stats-2013-09-12-Do you ever have that I-just-fit-a-model feeling?

18 0.70516962 1395 andrew gelman stats-2012-06-27-Cross-validation (What is it good for?)

19 0.70437294 773 andrew gelman stats-2011-06-18-Should we always be using the t and robit instead of the normal and logit?

20 0.6987834 1460 andrew gelman stats-2012-08-16-“Real data can be a pain”


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(16, 0.033), (21, 0.026), (24, 0.199), (28, 0.024), (30, 0.094), (34, 0.032), (38, 0.032), (55, 0.035), (85, 0.026), (86, 0.018), (99, 0.363)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98529375 852 andrew gelman stats-2011-08-13-Checking your model using fake data

Introduction: Someone sent me the following email: I tried to do a logistic regression . . . I programmed the model in different ways and got different answers . . . can’t get the results to match . . . What am I doing wrong? . . . Here’s my code . . . I didn’t have the time to look at his code so I gave the following general response: One way to check things is to try simulating data from the fitted model, then fit your model again to the simulated data and see what happens. P.S. He followed my suggestion and responded a few days later: Yeah, that did the trick! I was treating a factor variable as a covariate! I love it when generic advice works out!

2 0.9763478 1176 andrew gelman stats-2012-02-19-Standardized writing styles and standardized graphing styles

Introduction: Back in the 1700s—JennyD can correct me if I’m wrong here—there was no standard style for writing. You could be discursive, you could be descriptive, flowery, or terse. Direct or indirect, serious or funny. You could construct a novel out of letters or write a philosophical treatise in the form of a novel. Nowadays there are rules. You can break the rules, but then you’re Breaking. The. Rules. Which is a distinctive choice all its own. Consider academic writing. Serious works of economics or statistics tend to be written in a serious style in some version of plain academic English. The few exceptions (for example, by Tukey, Tufte, Mandelbrot, and Jaynes) are clearly exceptions, written in styles that are much celebrated but not so commonly followed. A serious work of statistics, or economics, or political science could be written in a highly unconventional form (consider, for example, Wallace Shawn’s plays), but academic writers in these fields tend to stick with the sta

3 0.97618568 1510 andrew gelman stats-2012-09-25-Incoherence of Bayesian data analysis

Introduction: Hogg writes: At the end this article you wonder about consistency. Have you ever considered the possibility that utility might resolve some of the problems? I have no idea if it would—I am not advocating that position—I just get some kind of intuition from phrases like “Judgment is required to decide…”. Perhaps there is a coherent and objective description of what is—or could be—done under a coherent “utility” model (like a utility that could be objectively agreed upon and computed). Utilities are usually subjective—true—but priors are usually subjective too. My reply: I’m happy to think about utility, for some particular problem or class of problems going to the effort of assigning costs and benefits to different outcomes. I agree that a utility analysis, even if (necessarily) imperfect, can usefully focus discussion. For example, if a statistical method for selecting variables is justified on the basis of cost, I like the idea of attempting to quantify the costs of ga

4 0.97541106 706 andrew gelman stats-2011-05-11-The happiness gene: My bottom line (for now)

Introduction: I had a couple of email exchanges with Jan-Emmanuel De Neve and James Fowler, two of the authors of the article on the gene that is associated with life satisfaction which we blogged the other day. (Bruno Frey, the third author of the article in question, is out of town according to his email.) Fowler also commented directly on the blog. I won’t go through all the details, but now I have a better sense of what’s going on. (Thanks, Jan and James!) Here’s my current understanding: 1. The original manuscript was divided into two parts: an article by De Neve alone published in the Journal of Human Genetics, and an article by De Neve, Fowler, Frey, and Nicholas Christakis submitted to Econometrica. The latter paper repeats the analysis from the Adolescent Health survey and also replicates with data from the Framingham heart study (hence Christakis’s involvement). The Framingham study measures a slightly different gene and uses a slightly life-satisfaction question com

5 0.97470874 970 andrew gelman stats-2011-10-24-Bell Labs

Introduction: Sining Chen told me they’re hiring in the statistics group at Bell Labs . I’ll do my bit for economic stimulus by announcing this job (see below). I love Bell Labs. I worked there for three summers, in a physics lab in 1985-86 under the supervision of Loren Pfeiffer, and by myself in the statistics group in 1990. I learned a lot working for Loren. He was a really smart and driven guy. His lab was a small set of rooms—in Bell Labs, everything’s in a small room, as they value the positive externality of close physical proximity of different labs, which you get by making each lab compact—and it was Loren, his assistant (a guy named Ken West who kept everything running in the lab), and three summer students: me, Gowton Achaibar, and a girl whose name I’ve forgotten. Gowtan and I had a lot of fun chatting in the lab. One day I made a silly comment about Gowton’s accent—he was from Guyana and pronounced “three” as “tree”—and then I apologized and said: Hey, here I am making fun o

6 0.97444367 2090 andrew gelman stats-2013-11-05-How much do we trust a new claim that early childhood stimulation raised earnings by 42%?

7 0.9743377 1605 andrew gelman stats-2012-12-04-Write This Book

8 0.97422421 878 andrew gelman stats-2011-08-29-Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go

9 0.97408068 1520 andrew gelman stats-2012-10-03-Advice that’s so eminently sensible but so difficult to follow

10 0.97376251 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes

11 0.97373885 2200 andrew gelman stats-2014-02-05-Prior distribution for a predicted probability

12 0.97371936 2174 andrew gelman stats-2014-01-17-How to think about the statistical evidence when the statistical evidence can’t be conclusive?

13 0.97360402 2129 andrew gelman stats-2013-12-10-Cross-validation and Bayesian estimation of tuning parameters

14 0.9734723 855 andrew gelman stats-2011-08-16-Infovis and statgraphics update update

15 0.97313213 2223 andrew gelman stats-2014-02-24-“Edlin’s rule” for routinely scaling down published estimates

16 0.97309846 1150 andrew gelman stats-2012-02-02-The inevitable problems with statistical significance and 95% intervals

17 0.97303712 788 andrew gelman stats-2011-07-06-Early stopping and penalized likelihood

18 0.9728809 86 andrew gelman stats-2010-06-14-“Too much data”?

19 0.97267109 594 andrew gelman stats-2011-02-28-Behavioral economics doesn’t seem to have much to say about marriage

20 0.97266978 262 andrew gelman stats-2010-09-08-Here’s how rumors get started: Lineplots, dotplots, and nonfunctional modernist architecture