andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-451 knowledge-graph by maker-knowledge-mining

451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?


meta infos for this blog

Source: html

Introduction: Fabio Rojas writes: In much of the social sciences outside economics, it’s very common for people to take a regression course or two in graduate school and then stop their statistical education. This creates a situation where you have a large pool of people who have some knowledge, but not a lot of knowledge. As a result, you have a pretty big gap between people like yourself, who are heavily invested in the cutting edge of applied statistics, and other folks. So here is the question: What are the major lessons about good statistical practice that “rank and file” social scientists should know? Sure, most people can recite “Correlation is not causation” or “statistical significance is not substantive significance.” But what are the other big lessons? This question comes from my own experience. I have a math degree and took regression analysis in graduate school, but I definitely do not have the level of knowledge of a statistician. I also do mixed method research, and field wor


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Fabio Rojas writes: In much of the social sciences outside economics, it’s very common for people to take a regression course or two in graduate school and then stop their statistical education. [sent-1, score-0.817]

2 This creates a situation where you have a large pool of people who have some knowledge, but not a lot of knowledge. [sent-2, score-0.195]

3 As a result, you have a pretty big gap between people like yourself, who are heavily invested in the cutting edge of applied statistics, and other folks. [sent-3, score-0.51]

4 So here is the question: What are the major lessons about good statistical practice that “rank and file” social scientists should know? [sent-4, score-0.386]

5 Sure, most people can recite “Correlation is not causation” or “statistical significance is not substantive significance. [sent-5, score-0.081]

6 I have a math degree and took regression analysis in graduate school, but I definitely do not have the level of knowledge of a statistician. [sent-8, score-0.63]

7 I also do mixed method research, and field work is very time intensive. [sent-9, score-0.082]

8 I often feel like that I face a tough choice – I can delve into more advanced statistics, but that often requires a huge investment on my part. [sent-10, score-0.403]

9 Is there a middle ground between the naive user of regression analysis and what you do? [sent-11, score-0.54]

10 My reply: You can take a look at my book with Jennifer Hill. [sent-12, score-0.111]

11 Chapters 3-5 hit the basics, then you can jump to chapters 9-10 for causal inference. [sent-13, score-0.251]

12 - Don’t just analyze your variables straight out of the box. [sent-15, score-0.37]

13 You can break continuous variables into categories (for example, instead of age and age-squared, you can use indicators for 19-29, 30-44, 45-64, 65+), and, from the other direction, you can average several related variables to create a combined score. [sent-16, score-0.783]

14 - You can typically treat a discrete outcome (for example, responses on a 1-5 scale) as numeric. [sent-17, score-0.16]

15 Don’t worry about ordered logit/probit/etc,, just run your regression already. [sent-18, score-0.379]

16 - Take the two most important input variables in your regression and throw in their interaction. [sent-19, score-0.578]

17 - The key assumptions of a regression model are validity and additivity. [sent-20, score-0.36]

18 Except when you’re focused on predictions, don’t spend one minute worrying about distributional issues such as normality or equal variance of the errors. [sent-21, score-0.447]

19 Possibly the readers of this blog could offer some suggested tips of their own? [sent-22, score-0.238]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('regression', 0.278), ('tips', 0.238), ('lessons', 0.22), ('variables', 0.211), ('chapters', 0.167), ('graduate', 0.151), ('delve', 0.13), ('basics', 0.13), ('normality', 0.126), ('knowledge', 0.121), ('invested', 0.119), ('distributional', 0.114), ('school', 0.111), ('take', 0.111), ('fabio', 0.108), ('rojas', 0.108), ('indicators', 0.106), ('worrying', 0.106), ('causation', 0.106), ('investment', 0.105), ('rank', 0.103), ('edge', 0.103), ('ordered', 0.101), ('minute', 0.101), ('cutting', 0.101), ('creates', 0.101), ('heavily', 0.095), ('pool', 0.094), ('gap', 0.092), ('ground', 0.092), ('combined', 0.091), ('input', 0.089), ('naive', 0.088), ('file', 0.087), ('advanced', 0.086), ('jump', 0.084), ('statistical', 0.083), ('categories', 0.083), ('social', 0.083), ('user', 0.082), ('tough', 0.082), ('validity', 0.082), ('mixed', 0.082), ('substantive', 0.081), ('break', 0.081), ('definitely', 0.08), ('discrete', 0.08), ('treat', 0.08), ('analyze', 0.08), ('straight', 0.079)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?

Introduction: Fabio Rojas writes: In much of the social sciences outside economics, it’s very common for people to take a regression course or two in graduate school and then stop their statistical education. This creates a situation where you have a large pool of people who have some knowledge, but not a lot of knowledge. As a result, you have a pretty big gap between people like yourself, who are heavily invested in the cutting edge of applied statistics, and other folks. So here is the question: What are the major lessons about good statistical practice that “rank and file” social scientists should know? Sure, most people can recite “Correlation is not causation” or “statistical significance is not substantive significance.” But what are the other big lessons? This question comes from my own experience. I have a math degree and took regression analysis in graduate school, but I definitely do not have the level of knowledge of a statistician. I also do mixed method research, and field wor

2 0.2127448 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit

3 0.13572283 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

Introduction: A research psychologist writes in with a question that’s so long that I’ll put my answer first, then put the question itself below the fold. Here’s my reply: As I wrote in my Anova paper and in my book with Jennifer Hill, I do think that multilevel models can completely replace Anova. At the same time, I think the central idea of Anova should persist in our understanding of these models. To me the central idea of Anova is not F-tests or p-values or sums of squares, but rather the idea of predicting an outcome based on factors with discrete levels, and understanding these factors using variance components. The continuous or categorical response thing doesn’t really matter so much to me. I have no problem using a normal linear model for continuous outcomes (perhaps suitably transformed) and a logistic model for binary outcomes. I don’t want to throw away interactions just because they’re not statistically significant. I’d rather partially pool them toward zero using an inform

4 0.13074116 796 andrew gelman stats-2011-07-10-Matching and regression: two great tastes etc etc

Introduction: Matthew Bogard writes: Regarding the book Mostly Harmless Econometrics, you state : A casual reader of the book might be left with the unfortunate impression that matching is a competitor to regression rather than a tool for making regression more effective. But in fact isn’t that what they are arguing, that, in a ‘mostly harmless way’ regression is in fact a matching estimator itself? “Our view is that regression can be motivated as a particular sort of weighted matching estimator, and therefore the differences between regression and matching estimates are unlikely to be of major empirical importance” (Chapter 3 p. 70) They seem to be distinguishing regression (without prior matching) from all other types of matching techniques, and therefore implying that regression can be a ‘mostly harmless’ substitute or competitor to matching. My previous understanding, before starting this book was as you say, that matching is a tool that makes regression more effective. I have n

5 0.12688415 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

Introduction: Steve Miller writes: Much of what I do is cross-national analyses of survey data (largely World Values Survey). . . . My big question pertains to (what I would call) exploratory analysis of multilevel data, especially when the group-level predictors are of theoretical importance. A lot of what I do involves analyzing cross-national survey items of citizen attitudes, typically of political leadership. These survey items are usually yes/no responses, or four-part responses indicating a level of agreement (strongly agree, agree, disagree, strongly disagree) that can be condensed into a binary variable. I believe these can be explained by reference to country-level factors. Much of the group-level variables of interest are count variables with a modal value of 0, which can be quite messy. How would you recommend exploring the variation in the dependent variable as it could be explained by the group-level count variable of interest, before fitting the multilevel model itself? When

6 0.12417243 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

7 0.12143586 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

8 0.1196084 1282 andrew gelman stats-2012-04-26-Bad news about (some) statisticians

9 0.11869285 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients

10 0.11798137 1971 andrew gelman stats-2013-08-07-I doubt they cheated

11 0.11237784 553 andrew gelman stats-2011-02-03-is it possible to “overstratify” when assigning a treatment in a randomized control trial?

12 0.11068225 146 andrew gelman stats-2010-07-14-The statistics and the science

13 0.11049561 1247 andrew gelman stats-2012-04-05-More philosophy of Bayes

14 0.10955173 602 andrew gelman stats-2011-03-06-Assumptions vs. conditions

15 0.10944034 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

16 0.10891467 1506 andrew gelman stats-2012-09-21-Building a regression model . . . with only 27 data points

17 0.10706724 375 andrew gelman stats-2010-10-28-Matching for preprocessing data for causal inference

18 0.10576832 1418 andrew gelman stats-2012-07-16-Long discussion about causal inference and the use of hierarchical models to bridge between different inferential settings

19 0.10424629 1228 andrew gelman stats-2012-03-25-Continuous variables in Bayesian networks

20 0.10176391 2357 andrew gelman stats-2014-06-02-Why we hate stepwise regression


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.218), (1, 0.049), (2, 0.029), (3, -0.045), (4, 0.075), (5, 0.075), (6, -0.038), (7, 0.02), (8, 0.055), (9, 0.123), (10, 0.013), (11, 0.013), (12, 0.042), (13, -0.02), (14, 0.019), (15, 0.01), (16, -0.031), (17, 0.016), (18, -0.011), (19, -0.001), (20, 0.02), (21, 0.027), (22, 0.035), (23, 0.022), (24, 0.047), (25, 0.059), (26, 0.097), (27, -0.115), (28, -0.078), (29, -0.018), (30, 0.105), (31, 0.071), (32, 0.034), (33, 0.023), (34, 0.011), (35, 0.0), (36, -0.022), (37, 0.064), (38, -0.065), (39, 0.025), (40, 0.012), (41, -0.012), (42, 0.005), (43, 0.035), (44, 0.066), (45, 0.008), (46, -0.034), (47, -0.021), (48, 0.021), (49, -0.005)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98125362 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?

Introduction: Fabio Rojas writes: In much of the social sciences outside economics, it’s very common for people to take a regression course or two in graduate school and then stop their statistical education. This creates a situation where you have a large pool of people who have some knowledge, but not a lot of knowledge. As a result, you have a pretty big gap between people like yourself, who are heavily invested in the cutting edge of applied statistics, and other folks. So here is the question: What are the major lessons about good statistical practice that “rank and file” social scientists should know? Sure, most people can recite “Correlation is not causation” or “statistical significance is not substantive significance.” But what are the other big lessons? This question comes from my own experience. I have a math degree and took regression analysis in graduate school, but I definitely do not have the level of knowledge of a statistician. I also do mixed method research, and field wor

2 0.84198505 2357 andrew gelman stats-2014-06-02-Why we hate stepwise regression

Introduction: Haynes Goddard writes: I have been slowly working my way through the grad program in stats here, and the latest course was a biostats course on categorical and survival analysis. I noticed in the semi-parametric and parametric material (Wang and Lee is the text) that they use stepwise regression a lot. I learned in econometrics that stepwise is poor practice, as it defaults to the “theory of the regression line”, that is no theory at all, just the variation in the data. I don’t find the topic on your blog, and wonder if you have addressed the issue. My reply: Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticans but are considered by statisticians to be a bit of a joke. For example, Jennifer and I don’t mention stepwise regression in our book, not even once. To address the issue more directly: the motivation behind stepwise regression is that you have a lot of potential predictors but not e

3 0.82428139 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit

4 0.81376958 796 andrew gelman stats-2011-07-10-Matching and regression: two great tastes etc etc

Introduction: Matthew Bogard writes: Regarding the book Mostly Harmless Econometrics, you state : A casual reader of the book might be left with the unfortunate impression that matching is a competitor to regression rather than a tool for making regression more effective. But in fact isn’t that what they are arguing, that, in a ‘mostly harmless way’ regression is in fact a matching estimator itself? “Our view is that regression can be motivated as a particular sort of weighted matching estimator, and therefore the differences between regression and matching estimates are unlikely to be of major empirical importance” (Chapter 3 p. 70) They seem to be distinguishing regression (without prior matching) from all other types of matching techniques, and therefore implying that regression can be a ‘mostly harmless’ substitute or competitor to matching. My previous understanding, before starting this book was as you say, that matching is a tool that makes regression more effective. I have n

5 0.80622882 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

Introduction: David Hoaglin writes: After seeing it cited, I just read your paper in Technometrics. The home radon levels provide an interesting and instructive example. I [Hoaglin] have a different take on the difficulty of interpreting the estimated coefficient of the county-level basement proportion (gamma-sub-2) on page 434. An important part of the difficulty involves “other things being equal.” That sounds like the widespread interpretation of a regression coefficient as telling how the dependent variable responds to change in that predictor when the other predictors are held constant. Unfortunately, as a general interpretation, that language is oversimplified; it doesn’t reflect how regression actually works. The appropriate general interpretation is that the coefficient tells how the dependent variable responds to change in that predictor after allowing for simultaneous change in the other predictors in the data at hand. Thus, in the county-level regression gamma-sub-2 summarize

6 0.79292828 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

7 0.79094434 375 andrew gelman stats-2010-10-28-Matching for preprocessing data for causal inference

8 0.77780211 1870 andrew gelman stats-2013-05-26-How to understand coefficients that reverse sign when you start controlling for things?

9 0.76168996 553 andrew gelman stats-2011-02-03-is it possible to “overstratify” when assigning a treatment in a randomized control trial?

10 0.73370516 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

11 0.73360252 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

12 0.73278779 1663 andrew gelman stats-2013-01-09-The effects of fiscal consolidation

13 0.72725081 144 andrew gelman stats-2010-07-13-Hey! Here’s a referee report for you!

14 0.72120601 14 andrew gelman stats-2010-05-01-Imputing count data

15 0.71998888 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

16 0.71939087 327 andrew gelman stats-2010-10-07-There are never 70 distinct parameters

17 0.70442808 1849 andrew gelman stats-2013-05-09-Same old same old

18 0.70302379 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

19 0.70174468 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

20 0.69806248 1971 andrew gelman stats-2013-08-07-I doubt they cheated


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(12, 0.028), (15, 0.038), (16, 0.078), (18, 0.027), (21, 0.015), (24, 0.138), (25, 0.085), (42, 0.016), (53, 0.031), (55, 0.035), (86, 0.035), (89, 0.057), (97, 0.01), (99, 0.301)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9684605 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?

Introduction: Fabio Rojas writes: In much of the social sciences outside economics, it’s very common for people to take a regression course or two in graduate school and then stop their statistical education. This creates a situation where you have a large pool of people who have some knowledge, but not a lot of knowledge. As a result, you have a pretty big gap between people like yourself, who are heavily invested in the cutting edge of applied statistics, and other folks. So here is the question: What are the major lessons about good statistical practice that “rank and file” social scientists should know? Sure, most people can recite “Correlation is not causation” or “statistical significance is not substantive significance.” But what are the other big lessons? This question comes from my own experience. I have a math degree and took regression analysis in graduate school, but I definitely do not have the level of knowledge of a statistician. I also do mixed method research, and field wor

2 0.96069133 1296 andrew gelman stats-2012-05-03-Google Translate for code, and an R help-list bot

Introduction: What we did in our Stan meeting yesterday: Some discussion of revision of the Nuts paper, some conversations about parameterizations of categorical-data models, plans for the R interface, blah blah blah. But also, I had two exciting new ideas! Google Translate for code Wouldn’t it be great if Google Translate could work on computer languages? I suggested this and somebody said that it might be a problem because code isn’t always translatable. But that doesn’t worry so much. Google Translate for human languages isn’t perfect either but it’s a useful guide. If I want to write a message to someone in French or Spanish or Dutch, I wouldn’t just write it in English and run it through Translate. What I do is try my best to write it in the desired language, but I can try out some tricky words or phrases in the translator. Or, if I start by translating, I go back and forth to make sure it all makes sense. An R help-list bot We were talking about how to build a Stan commun

3 0.957187 1151 andrew gelman stats-2012-02-03-Philosophy of Bayesian statistics: my reactions to Senn

Introduction: Continuing with my discussion of the articles in the special issue of the journal Rationality, Markets and Morals on the philosophy of Bayesian statistics: Stephen Senn, “You May Believe You Are a Bayesian But You Are Probably Wrong”: I agree with Senn’s comments on the impossibility of the de Finetti subjective Bayesian approach. As I wrote in 2008, if you could really construct a subjective prior you believe in, why not just look at the data and write down your subjective posterior. The immense practical difficulties with any serious system of inference render it absurd to think that it would be possible to just write down a probability distribution to represent uncertainty. I wish, however, that Senn would recognize my Bayesian approach (which is also that of John Carlin, Hal Stern, Don Rubin, and, I believe, others). De Finetti is no longer around, but we are! I have to admit that my own Bayesian views and practices have changed. In particular, I resonate wit

4 0.95675534 171 andrew gelman stats-2010-07-30-Silly baseball example illustrates a couple of key ideas they don’t usually teach you in statistics class

Introduction: From a commenter on the web, 21 May 2010: Tampa Bay: Playing .732 ball in the toughest division in baseball, wiped their feet on NY twice. If they sweep Houston, which seems pretty likely, they will be at .750, which I [the commenter] have never heard of. At the time of that posting, the Rays were 30-11. Quick calculation: if a team is good enough to be expected to win 100 games, that is, Pr(win) = 100/162 = .617, then there’s a 5% chance that they’ll have won at least 30 of their first 41 games. That’s a calculation based on simple probability theory of independent events, which isn’t quite right here but will get you close and is a good way to train one’s intuition , I think. Having a .732 record after 41 games is not unheard-of. The Detroit Tigers won 35 of their first 40 games in 1984: that’s .875. (I happen to remember that fast start, having been an Orioles fan at the time.) Now on to the key ideas The passage quoted above illustrates three statistical fa

5 0.95231819 1390 andrew gelman stats-2012-06-23-Traditionalist claims that modern art could just as well be replaced by a “paint-throwing chimp”

Introduction: Jed Dougherty points me to this opinion piece by Jacqueline Stevens, a professor of art at Northwestern University, who writes: Artists are defensive these days because in May the House passed an amendment to a bill eliminating the National Endowment for the Arts. Colleagues, especially those who have received N.E.A. grants, will loathe me for saying this, but just this once I’m sympathetic with the anti-intellectual Republicans behind this amendment. Why? The bill incited a national conversation about a subject that has troubled me for decades: the government — disproportionately — supports art that I do not like. Actually, just about nobody likes modern art. All those soup cans—what’s that all about? The stuff they have in museums nowadays, my 4-year-old could do better than that. Two-thirds of so-called modern artists are drunk and two-thirds are frauds. And, no, I didn’t get my math wrong—there’s just a lot of overlap among these categories! It’s an open secret in my

6 0.95002776 1596 andrew gelman stats-2012-11-29-More consulting experiences, this time in computational linguistics

7 0.94636858 1760 andrew gelman stats-2013-03-12-Misunderstanding the p-value

8 0.94521958 1939 andrew gelman stats-2013-07-15-Forward causal reasoning statements are about estimation; reverse causal questions are about model checking and hypothesis generation

9 0.9444207 2254 andrew gelman stats-2014-03-18-Those wacky anti-Bayesians used to be intimidating, but now they’re just pathetic

10 0.94392192 821 andrew gelman stats-2011-07-25-See me talk in the Upper West Side (without graphs) today

11 0.94337618 2089 andrew gelman stats-2013-11-04-Shlemiel the Software Developer and Unknown Unknowns

12 0.94331419 231 andrew gelman stats-2010-08-24-Yet another Bayesian job opportunity

13 0.94312525 1039 andrew gelman stats-2011-12-02-I just flew in from the econ seminar, and boy are my arms tired

14 0.94306922 2179 andrew gelman stats-2014-01-20-The AAA Tranche of Subprime Science

15 0.94287443 2297 andrew gelman stats-2014-04-20-Fooled by randomness

16 0.942141 1763 andrew gelman stats-2013-03-14-Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses

17 0.94195867 902 andrew gelman stats-2011-09-12-The importance of style in academic writing

18 0.94189829 2210 andrew gelman stats-2014-02-13-Stopping rules and Bayesian analysis

19 0.94178611 1117 andrew gelman stats-2012-01-13-What are the important issues in ethics and statistics? I’m looking for your input!

20 0.94164497 1878 andrew gelman stats-2013-05-31-How to fix the tabloids? Toward replicable social science research