andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-1967 knowledge-graph by maker-knowledge-mining

1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?


meta infos for this blog

Source: html

Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. [sent-1, score-0.136]

2 ” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). [sent-3, score-0.931]

3 I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc. [sent-4, score-1.108]

4 I agree that leverage in a influence is important, but normality of the variables? [sent-6, score-0.561]

5 The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. [sent-7, score-0.242]

6 My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. [sent-8, score-0.559]

7 6 of my book with Jennifer we list the assumptions of the linear regression model. [sent-10, score-0.565]

8 In decreasing order of importance, these assumptions are: 1. [sent-11, score-0.362]

9 Most importantly, the data you are analyzing should map to the research question you are trying to answer. [sent-13, score-0.157]

10 This sounds obvious but is often overlooked or ignored because it can be inconvenient. [sent-14, score-0.232]

11 The most important mathematical assumption of the regression model is that its deterministic component is a linear function of the separate predictors . [sent-20, score-0.871]

12 Further assumptions are necessary if a regression coefficient is to be given a causal interpretation . [sent-38, score-0.585]

13 Normality and equal variance are typically minor concerns, unless you’re using the model to make predictions for individual data points. [sent-41, score-0.507]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('normality', 0.378), ('assumptions', 0.252), ('dated', 0.244), ('assumption', 0.218), ('variables', 0.211), ('normally', 0.191), ('regression', 0.185), ('equal', 0.157), ('cooper', 0.145), ('independent', 0.139), ('rounds', 0.136), ('standardizing', 0.136), ('variance', 0.131), ('overlooked', 0.13), ('linear', 0.128), ('advice', 0.123), ('additivity', 0.119), ('tweet', 0.114), ('leverage', 0.114), ('residuals', 0.112), ('transforming', 0.11), ('decreasing', 0.11), ('andy', 0.105), ('deterministic', 0.103), ('admitting', 0.103), ('ignored', 0.102), ('independence', 0.101), ('popularity', 0.1), ('importantly', 0.1), ('component', 0.094), ('distributed', 0.091), ('opposed', 0.085), ('minor', 0.08), ('map', 0.079), ('weak', 0.079), ('analyzing', 0.078), ('coefficient', 0.078), ('jennifer', 0.077), ('concerns', 0.076), ('moved', 0.073), ('important', 0.073), ('importance', 0.072), ('thought', 0.072), ('predictions', 0.07), ('separate', 0.07), ('given', 0.07), ('influence', 0.069), ('overall', 0.069), ('unless', 0.069), ('bring', 0.069)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit

2 0.30088347 602 andrew gelman stats-2011-03-06-Assumptions vs. conditions

Introduction: Jeff Witmer writes: I noticed that you continue the standard practice in statistics of referring to assumptions; e.g. a blog entry on 2/4/11 at 10:54: “Our method, just like any model, relies on assumptions which we have the duty to state and to check.” I’m in the 6th year of a three-year campaign to get statisticians to drop the word “assumptions” and replace it with “conditions.” The problem, as I see it, is that people tend to think that an assumption is something that one assumes, as in “assuming that we have a right triangle…” or “assuming that k is even…” when constructing a mathematical proof. But in statistics we don’t assume things — unless we have to. Instead, we know that, for example, the validity of a t-test depends on normality, which is a condition that can and should be checked. Let’s not call normality an assumption, lest we imply that it is something that can be assumed. Let’s call it a condition. What do you all think?

3 0.28454214 2046 andrew gelman stats-2013-10-01-I’ll say it again

Introduction: Milan Valasek writes: Psychology students (and probably students in other disciplines) are often taught that in order to perform ‘parametric’ tests, e.g. independent t-test, the data for each group need to be normally distributed. However, in literature (and various university lecture notes and slides accessible online), I have come across at least 4 different interpretation of what it is that is supposed to be normally distributed when doing a t-test: 1. population 2. sampled data for each group 3. distribution of estimates of means for each group 4. distribution of estimates of the difference between groups I can see how 2 would follow from 1 and 4 from 3 but even then, there are two different sets of interpretations of the normality assumption. Could you please put this issue to rest for me? My quick response is that normality is not so important unless you are focusing on prediction.

4 0.2127448 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?

Introduction: Fabio Rojas writes: In much of the social sciences outside economics, it’s very common for people to take a regression course or two in graduate school and then stop their statistical education. This creates a situation where you have a large pool of people who have some knowledge, but not a lot of knowledge. As a result, you have a pretty big gap between people like yourself, who are heavily invested in the cutting edge of applied statistics, and other folks. So here is the question: What are the major lessons about good statistical practice that “rank and file” social scientists should know? Sure, most people can recite “Correlation is not causation” or “statistical significance is not substantive significance.” But what are the other big lessons? This question comes from my own experience. I have a math degree and took regression analysis in graduate school, but I definitely do not have the level of knowledge of a statistician. I also do mixed method research, and field wor

5 0.16485244 1418 andrew gelman stats-2012-07-16-Long discussion about causal inference and the use of hierarchical models to bridge between different inferential settings

Introduction: Elias Bareinboim asked what I thought about his comment on selection bias in which he referred to a paper by himself and Judea Pearl, “Controlling Selection Bias in Causal Inference.” I replied that I have no problem with what he wrote, but that from my perspective I find it easier to conceptualize such problems in terms of multilevel models. I elaborated on that point in a recent post , “Hierarchical modeling as a framework for extrapolation,” which I think was read by only a few people (I say this because it received only two comments). I don’t think Bareinboim objected to anything I wrote, but like me he is comfortable working within his own framework. He wrote the following to me: In some sense, “not ad hoc” could mean logically consistent. In other words, if one agrees with the assumptions encoded in the model, one must also agree with the conclusions entailed by these assumptions. I am not aware of any other way of doing mathematics. As it turns out, to get causa

6 0.14476655 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients

7 0.13986981 1506 andrew gelman stats-2012-09-21-Building a regression model . . . with only 27 data points

8 0.13664398 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

9 0.12427953 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

10 0.12276942 2315 andrew gelman stats-2014-05-02-Discovering general multidimensional associations

11 0.12064064 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

12 0.11647294 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

13 0.11416701 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

14 0.11128098 1704 andrew gelman stats-2013-02-03-Heuristics for identifying ecological fallacies?

15 0.11096927 796 andrew gelman stats-2011-07-10-Matching and regression: two great tastes etc etc

16 0.11092689 1149 andrew gelman stats-2012-02-01-Philosophy of Bayesian statistics: my reactions to Cox and Mayo

17 0.11029921 1849 andrew gelman stats-2013-05-09-Same old same old

18 0.1087804 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering

19 0.1083933 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

20 0.10750476 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.166), (1, 0.099), (2, 0.05), (3, -0.034), (4, 0.068), (5, 0.016), (6, 0.005), (7, -0.03), (8, 0.091), (9, 0.094), (10, 0.022), (11, 0.038), (12, 0.016), (13, -0.015), (14, 0.008), (15, 0.043), (16, -0.037), (17, -0.009), (18, -0.021), (19, 0.005), (20, 0.009), (21, 0.022), (22, 0.097), (23, -0.015), (24, 0.04), (25, 0.049), (26, 0.072), (27, -0.056), (28, -0.027), (29, 0.033), (30, 0.051), (31, 0.033), (32, 0.017), (33, 0.044), (34, 0.013), (35, -0.035), (36, -0.017), (37, -0.021), (38, -0.032), (39, 0.016), (40, 0.039), (41, -0.056), (42, 0.048), (43, -0.01), (44, 0.05), (45, 0.05), (46, -0.026), (47, -0.004), (48, 0.001), (49, -0.016)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97736412 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit

2 0.82078797 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

Introduction: David Hoaglin writes: After seeing it cited, I just read your paper in Technometrics. The home radon levels provide an interesting and instructive example. I [Hoaglin] have a different take on the difficulty of interpreting the estimated coefficient of the county-level basement proportion (gamma-sub-2) on page 434. An important part of the difficulty involves “other things being equal.” That sounds like the widespread interpretation of a regression coefficient as telling how the dependent variable responds to change in that predictor when the other predictors are held constant. Unfortunately, as a general interpretation, that language is oversimplified; it doesn’t reflect how regression actually works. The appropriate general interpretation is that the coefficient tells how the dependent variable responds to change in that predictor after allowing for simultaneous change in the other predictors in the data at hand. Thus, in the county-level regression gamma-sub-2 summarize

3 0.81523556 796 andrew gelman stats-2011-07-10-Matching and regression: two great tastes etc etc

Introduction: Matthew Bogard writes: Regarding the book Mostly Harmless Econometrics, you state : A casual reader of the book might be left with the unfortunate impression that matching is a competitor to regression rather than a tool for making regression more effective. But in fact isn’t that what they are arguing, that, in a ‘mostly harmless way’ regression is in fact a matching estimator itself? “Our view is that regression can be motivated as a particular sort of weighted matching estimator, and therefore the differences between regression and matching estimates are unlikely to be of major empirical importance” (Chapter 3 p. 70) They seem to be distinguishing regression (without prior matching) from all other types of matching techniques, and therefore implying that regression can be a ‘mostly harmless’ substitute or competitor to matching. My previous understanding, before starting this book was as you say, that matching is a tool that makes regression more effective. I have n

4 0.79642946 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

Introduction: Mike Johns writes: Are you familiar with the work of Ai and Norton on interactions in logit/probit models? I’d be curious to hear your thoughts. Ai, C.R. and Norton E.C. 2003. Interaction terms in logit and probit models. Economics Letters 80(1): 123-129. A peer ref just cited this paper in reaction to a logistic model we tested and claimed that the “only” way to test an interaction in logit/probit regression is to use the cross derivative method of Ai & Norton. I’ve never heard of this issue or method. It leaves me wondering what the interaction term actually tests (something Ai & Norton don’t discuss) and why such an important discovery is not more widely known. Is this an issue that is of particular relevance to econometric analysis because they approach interactions from the difference-in-difference perspective? Full disclosure, I’m coming from a social science/epi background. Thus, i’m not interested in the d-in-d estimator; I want to know if any variables modify the rela

5 0.79596573 375 andrew gelman stats-2010-10-28-Matching for preprocessing data for causal inference

Introduction: Chris Blattman writes : Matching is not an identification strategy a solution to your endogeneity problem; it is a weighting scheme. Saying matching will reduce endogeneity bias is like saying that the best way to get thin is to weigh yourself in kilos. The statement makes no sense. It confuses technique with substance. . . . When you run a regression, you control for the X you can observe. When you match, you are simply matching based on those same X. . . . I see what Chris is getting at–matching, like regression, won’t help for the variables you’re not controlling for–but I disagree with his characterization of matching as a weighting scheme. I see matching as a way to restrict your analysis to comparable cases. The statistical motivation: robustness. If you had a good enough model, you wouldn’t neet to match, you’d just fit the model to the data. But in common practice we often use simple regression models and so it can be helpful to do some matching first before regress

6 0.77679014 1849 andrew gelman stats-2013-05-09-Same old same old

7 0.76697701 2357 andrew gelman stats-2014-06-02-Why we hate stepwise regression

8 0.76506615 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

9 0.76435858 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

10 0.76251334 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?

11 0.76090658 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

12 0.75514406 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

13 0.74856025 553 andrew gelman stats-2011-02-03-is it possible to “overstratify” when assigning a treatment in a randomized control trial?

14 0.73075658 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

15 0.72473449 1196 andrew gelman stats-2012-03-04-Piss-poor monocausal social science

16 0.72355568 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

17 0.7232222 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

18 0.72122252 14 andrew gelman stats-2010-05-01-Imputing count data

19 0.72070897 1663 andrew gelman stats-2013-01-09-The effects of fiscal consolidation

20 0.69397867 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(15, 0.045), (16, 0.076), (18, 0.197), (20, 0.013), (21, 0.016), (24, 0.154), (47, 0.013), (50, 0.012), (51, 0.013), (56, 0.012), (81, 0.013), (84, 0.026), (86, 0.046), (95, 0.013), (99, 0.263)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.95086861 969 andrew gelman stats-2011-10-22-Researching the cost-effectiveness of political lobbying organisations

Introduction: Sally Murray from Giving What We Can writes: We are an organisation that assesses different charitable (/fundable) interventions, to estimate which are the most cost-effective (measured in terms of the improvement of life for people in developing countries gained for every dollar invested). Our research guides and encourages greater donations to the most cost-effective charities we thus identify, and our members have so far pledged a total of $14m to these causes, with many hundreds more relying on our advice in a less formal way. I am specifically researching the cost-effectiveness of political lobbying organisations. We are initially focusing on organisations that lobby for ‘big win’ outcomes such as increased funding of the most cost-effective NTD treatments/ vaccine research, changes to global trade rules (potentially) and more obscure lobbies such as “Keep Antibiotics Working”. We’ve a great deal of respect for your work and the superbly rational way you go about it, and

same-blog 2 0.93573546 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit

3 0.92299747 456 andrew gelman stats-2010-12-07-The red-state, blue-state war is happening in the upper half of the income distribution

Introduction: As we said in Red State, Blue State, it’s not the Prius vs. the pickup truck, it’s the Prius vs. the Hummer. Here’s the graph: Or, as Ross Douthat put it in an op-ed yesterday: This means that a culture war that’s often seen as a clash between liberal elites and a conservative middle America looks more and more like a conflict within the educated class — pitting Wheaton and Baylor against Brown and Bard, Redeemer Presbyterian Church against the 92nd Street Y, C. S. Lewis devotees against the Philip Pullman fan club. Our main motivation for doing this work was to change how the news media think about America’s political divisions, and so it’s good to see our ideas getting mainstreamed and moving toward conventional wisdom. P.S. Here’s the time series of graphs showing how the pattern that we and Douthat noticed, of a battle between coastal states and middle America that is occurring among upper-income Americans, is relatively recent, having arisen in the Clinton ye

4 0.92202306 1183 andrew gelman stats-2012-02-25-Calibration!

Introduction: I went to this place a few months ago after it was reviewed in the Times and I was not impressed at all. Not that I’m any kind of authority on barbecue, this just makes me aware of variation in assessments. Food criticism is like personality profiling in psychometrics: there is no objective truth to measure; any meaningful evaluation is inherently statistical.

5 0.92032897 718 andrew gelman stats-2011-05-18-Should kids be able to bring their own lunches to school?

Introduction: I encountered this news article , “Chicago school bans some lunches brought from home”: At Little Village, most students must take the meals served in the cafeteria or go hungry or both. . . . students are not allowed to pack lunches from home. Unless they have a medical excuse, they must eat the food served in the cafeteria. . . . Such discussions over school lunches and healthy eating echo a larger national debate about the role government should play in individual food choices. “This is such a fundamental infringement on parental responsibility,” said J. Justin Wilson, a senior researcher at the Washington-based Center for Consumer Freedom, which is partially funded by the food industry. . . . For many CPS parents, the idea of forbidding home-packed lunches would be unthinkable. . . . If I had read this two years ago, I’d be at one with J. Justin Wilson and the outraged kids and parents. But last year we spent a sabbatical in Paris, where . . . kids aren’t allowed to bring

6 0.91815197 698 andrew gelman stats-2011-05-05-Shocking but not surprising

7 0.91216296 1292 andrew gelman stats-2012-05-01-Colorless green facts asserted resolutely

8 0.90548795 2046 andrew gelman stats-2013-10-01-I’ll say it again

9 0.90197885 829 andrew gelman stats-2011-07-29-Infovis vs. statgraphics: A clear example of their different goals

10 0.9003343 1319 andrew gelman stats-2012-05-14-I hate to get all Gerd Gigerenzer on you here, but . . .

11 0.89883435 588 andrew gelman stats-2011-02-24-In case you were wondering, here’s the price of milk

12 0.89604461 1691 andrew gelman stats-2013-01-25-Extreem p-values!

13 0.885234 621 andrew gelman stats-2011-03-20-Maybe a great idea in theory, didn’t work so well in practice

14 0.88172388 1922 andrew gelman stats-2013-07-02-They want me to send them free material and pay for the privilege

15 0.88078833 114 andrew gelman stats-2010-06-28-More on Bayesian deduction-induction

16 0.87576765 1074 andrew gelman stats-2011-12-20-Reading a research paper != agreeing with its claims

17 0.87256497 1204 andrew gelman stats-2012-03-08-The politics of economic and statistical models

18 0.86547971 2239 andrew gelman stats-2014-03-09-Reviewing the peer review process?

19 0.86366814 2181 andrew gelman stats-2014-01-21-The Commissar for Traffic presents the latest Five-Year Plan

20 0.85771191 1382 andrew gelman stats-2012-06-17-How to make a good fig?