andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1196 knowledge-graph by maker-knowledge-mining

1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science


meta infos for this blog

Source: html

Introduction: Dan Kahan writes: Okay, have done due diligence here & can’t find the reference. It was in recent blog — and was more or less an aside — but you ripped into researchers (pretty sure econometricians, but this could be my memory adding to your account recollections it conjured from my own experience) who purport to make estimates or predictions based on multivariate regression in which the value of particular predictor is set at some level while others “held constant” etc., on ground that variance in that particular predictor independent of covariance in other model predictors is unrealistic. You made it sound, too, as if this were one of the pet peeves in your menagerie — leading me to think you had blasted into it before. Know what I’m talking about? Also — isn’t this really just a way of saying that the model is misspecified — at least if the goal is to try to make a valid & unbiased estimate of the impact of that particular predictor? The problem can’t be that one is usin


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Dan Kahan writes: Okay, have done due diligence here & can’t find the reference. [sent-1, score-0.122]

2 , on ground that variance in that particular predictor independent of covariance in other model predictors is unrealistic. [sent-3, score-1.138]

3 You made it sound, too, as if this were one of the pet peeves in your menagerie — leading me to think you had blasted into it before. [sent-4, score-0.273]

4 Also — isn’t this really just a way of saying that the model is misspecified — at least if the goal is to try to make a valid & unbiased estimate of the impact of that particular predictor? [sent-6, score-0.766]

5 In any case, I’m reminded of the advice I often give that each causal inference typically requires its own analysis. [sent-13, score-0.176]

6 I’m generally skeptical of an analysis where someone picks out one coefficient to address Hypothesis 1, another to address Hypothesis 2a, and so forth. [sent-14, score-0.428]

7 If a causal inference can be framed via a natural experiment on some variable x, you want to control for things that come before x and not what comes after (I’m thinking of logical order, which is related to but is not identical to time order). [sent-15, score-0.479]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('predictor', 0.421), ('covariate', 0.279), ('collinear', 0.245), ('impact', 0.234), ('variance', 0.166), ('covariance', 0.162), ('estimate', 0.152), ('constant', 0.149), ('address', 0.127), ('diligence', 0.122), ('posits', 0.122), ('purport', 0.122), ('independent', 0.118), ('regression', 0.118), ('invalid', 0.115), ('peeves', 0.11), ('try', 0.108), ('ripped', 0.107), ('picks', 0.101), ('omitted', 0.101), ('hypothesis', 0.099), ('okay', 0.098), ('causal', 0.098), ('model', 0.097), ('affairs', 0.096), ('particular', 0.096), ('commonplace', 0.095), ('order', 0.095), ('influences', 0.091), ('econometricians', 0.091), ('pet', 0.09), ('interest', 0.084), ('framed', 0.083), ('holding', 0.083), ('memory', 0.08), ('kahan', 0.08), ('combine', 0.08), ('unbiased', 0.079), ('ground', 0.078), ('inference', 0.078), ('thinking', 0.074), ('phenomenon', 0.073), ('identical', 0.073), ('index', 0.073), ('logical', 0.073), ('yield', 0.073), ('one', 0.073), ('multivariate', 0.072), ('biased', 0.072), ('member', 0.071)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science

Introduction: Dan Kahan writes: Okay, have done due diligence here & can’t find the reference. It was in recent blog — and was more or less an aside — but you ripped into researchers (pretty sure econometricians, but this could be my memory adding to your account recollections it conjured from my own experience) who purport to make estimates or predictions based on multivariate regression in which the value of particular predictor is set at some level while others “held constant” etc., on ground that variance in that particular predictor independent of covariance in other model predictors is unrealistic. You made it sound, too, as if this were one of the pet peeves in your menagerie — leading me to think you had blasted into it before. Know what I’m talking about? Also — isn’t this really just a way of saying that the model is misspecified — at least if the goal is to try to make a valid & unbiased estimate of the impact of that particular predictor? The problem can’t be that one is usin

2 0.21333627 1367 andrew gelman stats-2012-06-05-Question 26 of my final exam for Design and Analysis of Sample Surveys

Introduction: 26. You have just graded an an exam with 28 questions and 15 students. You fit a logistic item- response model estimating ability, difficulty, and discrimination parameters. Which of the following statements are basically true? (Indicate all that apply.) (a) If a question is answered correctly by students with very low and very high ability, but is missed by students in the middle, it will have a high value for its discrimination parameter. (b) It is not possible to fit an item-response model when you have more questions than students. In order to fit the model, you either need to reduce the number of questions (for example, by discarding some questions or by putting together some questions into a combined score) or increase the number of students in the dataset. (c) To keep the model identified, you can set one of the difficulty parameters or one of the ability parameters to zero and set one of the discrimination parameters to 1. (d) If two students answer the same number of q

3 0.20137441 1365 andrew gelman stats-2012-06-04-Question 25 of my final exam for Design and Analysis of Sample Surveys

Introduction: 25. You are using multilevel regression and poststratification (MRP) to a survey of 1500 people to estimate support for the space program, by state. The model is fit using, as a state- level predictor, the Republican presidential vote in the state, which turns out to have a low correlation with support for the space program. Which of the following statements are basically true? (Indicate all that apply.) (a) For small states, the MRP estimates will be determined almost entirely by the demo- graphic characteristics of the respondents in the sample from that state. (b) For small states, the MRP estimates will be determined almost entirely by the demographic characteristics of the population in that state. (c) Adding a predictor specifically for this model (for example, a measure of per-capita space-program spending in the state) could dramatically improve the estimates of state-level opinion. (d) It would not be appropriate to add a predictor such as per-capita space-program spen

4 0.17846112 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

Introduction: David Hoaglin writes: After seeing it cited, I just read your paper in Technometrics. The home radon levels provide an interesting and instructive example. I [Hoaglin] have a different take on the difficulty of interpreting the estimated coefficient of the county-level basement proportion (gamma-sub-2) on page 434. An important part of the difficulty involves “other things being equal.” That sounds like the widespread interpretation of a regression coefficient as telling how the dependent variable responds to change in that predictor when the other predictors are held constant. Unfortunately, as a general interpretation, that language is oversimplified; it doesn’t reflect how regression actually works. The appropriate general interpretation is that the coefficient tells how the dependent variable responds to change in that predictor after allowing for simultaneous change in the other predictors in the data at hand. Thus, in the county-level regression gamma-sub-2 summarize

5 0.1577712 960 andrew gelman stats-2011-10-15-The bias-variance tradeoff

Introduction: Joshua Vogelstein asks for my thoughts as a Bayesian on the above topic. So here they are (briefly): The concept of the bias-variance tradeoff can be useful if you don’t take it too seriously. The basic idea is as follows: if you’re estimating something, you can slice your data finer and finer, or perform more and more adjustments, each time getting a purer—and less biased—estimate. But each subdivision or each adjustment reduces your sample size or increases potential estimation error, hence the variance of your estimate goes up. That story is real. In lots and lots of examples, there’s a continuum between a completely unadjusted general estimate (high bias, low variance) and a specific, focused, adjusted estimate (low bias, high variance). Suppose, for example, you’re using data from a large experiment to estimate the effect of a treatment on a fairly narrow group, say, white men between the ages of 45 and 50. At one extreme, you could just take the estimated treatment e

6 0.15221897 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

7 0.14623556 936 andrew gelman stats-2011-10-02-Covariate Adjustment in RCT - Model Overfitting in Multilevel Regression

8 0.13259037 852 andrew gelman stats-2011-08-13-Checking your model using fake data

9 0.12238123 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

10 0.11925094 228 andrew gelman stats-2010-08-24-A new efficient lossless compression algorithm

11 0.11779374 810 andrew gelman stats-2011-07-20-Adding more information can make the variance go up (depending on your model)

12 0.11760513 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

13 0.11207137 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?

14 0.11116601 1418 andrew gelman stats-2012-07-16-Long discussion about causal inference and the use of hierarchical models to bridge between different inferential settings

15 0.10862479 25 andrew gelman stats-2010-05-10-Two great tastes that taste great together

16 0.10750476 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

17 0.1074052 759 andrew gelman stats-2011-06-11-“2 level logit with 2 REs & large sample. computational nightmare – please help”

18 0.10736363 303 andrew gelman stats-2010-09-28-“Genomics” vs. genetics

19 0.10733156 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients

20 0.10617535 328 andrew gelman stats-2010-10-08-Displaying a fitted multilevel model


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.189), (1, 0.101), (2, 0.085), (3, -0.045), (4, 0.053), (5, 0.016), (6, 0.02), (7, -0.037), (8, 0.076), (9, 0.069), (10, 0.019), (11, 0.038), (12, 0.025), (13, 0.003), (14, 0.001), (15, 0.014), (16, -0.024), (17, -0.013), (18, -0.025), (19, 0.028), (20, -0.008), (21, -0.022), (22, 0.055), (23, 0.017), (24, 0.048), (25, 0.023), (26, 0.04), (27, -0.039), (28, -0.04), (29, -0.008), (30, 0.074), (31, -0.053), (32, 0.009), (33, -0.032), (34, -0.018), (35, -0.046), (36, 0.055), (37, -0.017), (38, 0.047), (39, -0.009), (40, -0.042), (41, -0.055), (42, -0.07), (43, 0.032), (44, 0.027), (45, 0.044), (46, -0.012), (47, 0.044), (48, 0.03), (49, -0.009)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96374261 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science

Introduction: Dan Kahan writes: Okay, have done due diligence here & can’t find the reference. It was in recent blog — and was more or less an aside — but you ripped into researchers (pretty sure econometricians, but this could be my memory adding to your account recollections it conjured from my own experience) who purport to make estimates or predictions based on multivariate regression in which the value of particular predictor is set at some level while others “held constant” etc., on ground that variance in that particular predictor independent of covariance in other model predictors is unrealistic. You made it sound, too, as if this were one of the pet peeves in your menagerie — leading me to think you had blasted into it before. Know what I’m talking about? Also — isn’t this really just a way of saying that the model is misspecified — at least if the goal is to try to make a valid & unbiased estimate of the impact of that particular predictor? The problem can’t be that one is usin

2 0.78884494 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

Introduction: David Hoaglin writes: After seeing it cited, I just read your paper in Technometrics. The home radon levels provide an interesting and instructive example. I [Hoaglin] have a different take on the difficulty of interpreting the estimated coefficient of the county-level basement proportion (gamma-sub-2) on page 434. An important part of the difficulty involves “other things being equal.” That sounds like the widespread interpretation of a regression coefficient as telling how the dependent variable responds to change in that predictor when the other predictors are held constant. Unfortunately, as a general interpretation, that language is oversimplified; it doesn’t reflect how regression actually works. The appropriate general interpretation is that the coefficient tells how the dependent variable responds to change in that predictor after allowing for simultaneous change in the other predictors in the data at hand. Thus, in the county-level regression gamma-sub-2 summarize

3 0.76660246 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

Introduction: When it rains it pours . . . John Transue writes: I saw a post on Andrew Sullivan’s blog today about life expectancy in different US counties. With a bunch of the worst counties being in Mississippi, I thought that it might be another case of analysts getting extreme values from small counties. However, the paper (see here ) includes a pretty interesting methods section. This is from page 5, “Specifically, we used a mixed-effects Poisson regression with time, geospatial, and covariate components. Poisson regression fits count outcome variables, e.g., death counts, and is preferable to a logistic model because the latter is biased when an outcome is rare (occurring in less than 1% of observations).” They have downloadable data. I believe that the data are predicted values from the model. A web appendix also gives 90% CIs for their estimates. Do you think they solved the small county problem and that the worst counties really are where their spreadsheet suggests? My re

4 0.73613065 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit

5 0.70790631 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

Introduction: Andreas Graefe writes (see here here here ): The usual procedure for developing linear models to predict any kind of target variable is to identify a subset of most important predictors and to estimate weights that provide the best possible solution for a given sample. The resulting “optimally” weighted linear composite is then used when predicting new data. This approach is useful in situations with large and reliable datasets and few predictor variables. However, a large body of analytical and empirical evidence since the 1970s shows that the weighting of variables is of little, if any, value in situations with small and noisy datasets and a large number of predictor variables. In such situations, including all relevant variables is more important than their weighting. These findings have yet to impact many fields. This study uses data from nine established U.S. election-forecasting models whose forecasts are regularly published in academic journals to demonstrate the value o

6 0.70784354 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

7 0.70241785 393 andrew gelman stats-2010-11-04-Estimating the effect of A on B, and also the effect of B on A

8 0.70237345 1966 andrew gelman stats-2013-08-03-Uncertainty in parameter estimates using multilevel models

9 0.69155931 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

10 0.68846262 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

11 0.68468022 550 andrew gelman stats-2011-02-02-An IV won’t save your life if the line is tangled

12 0.68053108 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

13 0.68038827 553 andrew gelman stats-2011-02-03-is it possible to “overstratify” when assigning a treatment in a randomized control trial?

14 0.67759955 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

15 0.67631084 796 andrew gelman stats-2011-07-10-Matching and regression: two great tastes etc etc

16 0.67167634 301 andrew gelman stats-2010-09-28-Correlation, prediction, variation, etc.

17 0.66989112 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering

18 0.66837054 368 andrew gelman stats-2010-10-25-Is instrumental variables analysis particularly susceptible to Type M errors?

19 0.66480041 1441 andrew gelman stats-2012-08-02-“Based on my experiences, I think you could make general progress by constructing a solution to your specific problem.”

20 0.66148841 2357 andrew gelman stats-2014-06-02-Why we hate stepwise regression


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.223), (15, 0.03), (16, 0.055), (21, 0.035), (22, 0.014), (24, 0.163), (45, 0.016), (56, 0.011), (82, 0.021), (86, 0.028), (89, 0.038), (99, 0.27)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97720867 549 andrew gelman stats-2011-02-01-“Roughly 90% of the increase in . . .” Hey, wait a minute!

Introduction: Matthew Yglesias links approvingly to the following statement by Michael Mandel: Homeland Security accounts for roughly 90% of the increase in federal regulatory employment over the past ten years. Roughly 90%, huh? That sounds pretty impressive. But wait a minute . . . what if total federal regulatory employment had increased a bit less. Then Homeland Security could’ve accounted for 105% of the increase, or 500% of the increase, or whatever. The point is the change in total employment is the sum of a bunch of pluses and minuses. It happens that, if you don’t count Homeland Security, the total hasn’t changed much–I’m assuming Mandel’s numbers are correct here–and that could be interesting. The “roughly 90%” figure is misleading because, when written as a percent of the total increase, it’s natural to quickly envision it as a percentage that is bounded by 100%. There is a total increase in regulatory employment that the individual agencies sum to, but some margins are p

2 0.97482121 17 andrew gelman stats-2010-05-05-Taking philosophical arguments literally

Introduction: Aaron Swartz writes the following, as a lead-in to an argument in favor of vegetarianism: Imagine you were an early settler of what is now the United States. It seems likely you would have killed native Americans. After all, your parents killed them, your siblings killed them, your friends killed them, the leaders of the community killed them, the President killed them. Chances are, you would have killed them too . . . Or if you see nothing wrong with killing native Americans, take the example of slavery. Again, everyone had slaves and probably didn’t think too much about the morality of it. . . . Are these statements true, though? It’s hard for me to believe that most early settlers (from the context, it looks like Swartz is discussing the 1500s-1700s here) killed native Americans. That is, if N is the number of early settlers, and Y is the number of these settlers who killed at least one Indian, I suspect Y/N is much closer to 0 than to 1. Similarly, it’s not even cl

3 0.97294515 1189 andrew gelman stats-2012-02-28-Those darn physicists

Introduction: X pointed me to this atrocity: The data on obesity are pretty unequivocal: we’re fat, and we’re getting fatter. Explanations for this trend, however, vary widely, with the blame alternately pinned on individual behaviour, genetics and the environment. In other words, it’s a race between “we eat too much”, “we’re born that way” and “it’s society’s fault”. Now, research by Lazaros Gallos has come down strongly in favour of the third option. Gallos and his colleagues at City College of New York treated the obesity rates in some 3000 US counties as “particles” in a physical system, and calculated the correlation between pairs of “particles” as a function of the distance between them. . . . the data indicated that the size of the “obesity cities” – geographic regions with correlated obesity rates – was huge, up to 1000 km. . . . Just to be clear: I have no problem with people calculating spatial autocorrelations (or even with them using quaint terminology such as referring to coun

4 0.97028768 97 andrew gelman stats-2010-06-18-Economic Disparities and Life Satisfaction in European Regions

Introduction: Grazia Pittau, Roberto Zelli, and I came out with a paper investigating the role of economic variables in predicting regional disparities in reported life satisfaction of European Union citizens. We use multilevel modeling to explicitly account for the hierarchical nature of our data, respondents within regions and countries, and for understanding patterns of variation within and between regions. Here’s what we found: - Personal income matters more in poor regions than in rich regions, a pattern that still holds for regions within the same country. - Being unemployed is negatively associated with life satisfaction even after controlled for income variation. Living in high unemployment regions does not alleviate the unhappiness of being out of work. - After controlling for individual characteristics and modeling interactions, regional differences in life satisfaction still remain. Here’s a quick graph; there’s more in the article:

5 0.9691931 1017 andrew gelman stats-2011-11-18-Lack of complete overlap

Introduction: Evens Salies writes: I have a question regarding a randomizing constraint in my current funded electricity experiment. After elimination of missing data we have 110 voluntary households from a larger population (resource constraints do not allow us to have more households!). I randomly assign them to threated and non treated where the treatment variable is some ICT that allows the treated to track their electricity consumption in real tim. The ICT is made of two devices, one that is plugged on the household’s modem and the other on the electric meter. A necessary condition for being treated is that the distance between the box and the meter be below some threshold (d), the value of which is 20 meters approximately. 50 ICTs can be installed. 60 households will be in the control group. But, I can only assign 6 households in the control group for whom d is less than 20. Therefore, I have only 6 households in the control group who have a counterfactual in the group of treated.

6 0.9685728 1698 andrew gelman stats-2013-01-30-The spam just gets weirder and weirder

7 0.96839577 1663 andrew gelman stats-2013-01-09-The effects of fiscal consolidation

8 0.96310169 489 andrew gelman stats-2010-12-28-Brow inflation

9 0.94764578 885 andrew gelman stats-2011-09-01-Needed: A Billionaire Candidate for President Who Shares the Views of a Washington Post Columnist

10 0.94315839 1102 andrew gelman stats-2012-01-06-Bayesian Anova found useful in ecology

11 0.94299507 1508 andrew gelman stats-2012-09-23-Speaking frankly

12 0.94069672 1872 andrew gelman stats-2013-05-27-More spam!

13 0.93592435 1893 andrew gelman stats-2013-06-11-Folic acid and autism

14 0.92587405 1954 andrew gelman stats-2013-07-24-Too Good To Be True: The Scientific Mass Production of Spurious Statistical Significance

15 0.92198718 1254 andrew gelman stats-2012-04-09-In the future, everyone will publish everything.

same-blog 16 0.91349864 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science

17 0.91228598 1567 andrew gelman stats-2012-11-07-Election reports

18 0.91160798 1171 andrew gelman stats-2012-02-16-“False-positive psychology”

19 0.90601385 663 andrew gelman stats-2011-04-15-Happy tax day!

20 0.89934164 2360 andrew gelman stats-2014-06-05-Identifying pathways for managing multiple disturbances to limit plant invasions