andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-257 knowledge-graph by maker-knowledge-mining

257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations


meta infos for this blog

Source: html

Introduction: Andrew Eppig writes: I’m a physicist by training who is transitioning to the social sciences. I recently came across a reference in the Economist to a paper on IQ and parasites which I read as I have more than a passing interest in IQ research (having read much that you and others (e.g., Shalizi, Wicherts) have written). In this paper I note that the authors find a very high correlation between national IQ and parasite prevalence. The strength of the correlation (-0.76 to -0.82) surprised me, as I’m used to much weaker correlations in the social sciences. To me, it’s a bit too high, suggesting that there are other factors at play or that one of the variables is merely a proxy for a large number of other variables. But I have no basis for this other than a gut feeling and a memory of a plot on Language Log about the distribution of correlation coefficients in social psychology. So my question is this: Is a correlation in the range of (-0.82,-0.76) more likely to be a correlatio


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Andrew Eppig writes: I’m a physicist by training who is transitioning to the social sciences. [sent-1, score-0.251]

2 I recently came across a reference in the Economist to a paper on IQ and parasites which I read as I have more than a passing interest in IQ research (having read much that you and others (e. [sent-2, score-0.23]

3 In this paper I note that the authors find a very high correlation between national IQ and parasite prevalence. [sent-5, score-0.334]

4 82) surprised me, as I’m used to much weaker correlations in the social sciences. [sent-8, score-0.328]

5 To me, it’s a bit too high, suggesting that there are other factors at play or that one of the variables is merely a proxy for a large number of other variables. [sent-9, score-0.397]

6 But I have no basis for this other than a gut feeling and a memory of a plot on Language Log about the distribution of correlation coefficients in social psychology. [sent-10, score-0.605]

7 So my question is this: Is a correlation in the range of (-0. [sent-11, score-0.316]

8 76) more likely to be a correlation between two variables with no deeper relationship or indicative of a missing set of underlying variables? [sent-13, score-0.714]

9 My reply: First off, I don’t think you can ever distinguish between correlations of . [sent-14, score-0.226]

10 I don’t think you can treat the high correlations as evidence against their argument. [sent-19, score-0.256]

11 Finally, are you related to the first author of the linked article, or is it just that you did a search on Eppig and encountered this stuff? [sent-21, score-0.305]

12 Eppig responds: I am in fact related to the first author of the study — he’s my brother. [sent-24, score-0.245]

13 Since my first question, I’ve been wondering about how to interpret the results of a regression when some of the dependent variables have been imputed via regression. [sent-25, score-0.666]

14 + xn) where x1 has had its missing values imputed using: fit. [sent-30, score-0.402]

15 + xn) Are there extra considerations required in interpreting the model fit. [sent-34, score-0.226]

16 Can one read off the coefficient values and errors from fit. [sent-36, score-0.289]

17 Naively, I feel that the errors in xn are now correlated with the other independent variables and a simple linear regression is no longer appropriate/valid. [sent-40, score-0.826]

18 Are the coefficients of x1, x2,…, xn valid but the errors invalid? [sent-41, score-0.639]

19 In general you want to fit both models together, or, in general, to model all the variables jointly. [sent-43, score-0.306]

20 That said, in practice I’ll typically just take the imputed x-values as exact and not think too hard about it. [sent-44, score-0.23]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('xn', 0.411), ('eppig', 0.338), ('correlation', 0.243), ('iq', 0.234), ('imputed', 0.23), ('variables', 0.225), ('wicherts', 0.174), ('lm', 0.169), ('correlations', 0.165), ('errors', 0.124), ('coefficients', 0.104), ('gut', 0.103), ('indicative', 0.097), ('invalid', 0.097), ('transitioning', 0.097), ('factors', 0.093), ('high', 0.091), ('social', 0.088), ('author', 0.088), ('missing', 0.087), ('values', 0.085), ('simultaneous', 0.085), ('model', 0.081), ('first', 0.08), ('read', 0.08), ('proxy', 0.079), ('knowledgeable', 0.079), ('related', 0.077), ('considerations', 0.077), ('naively', 0.076), ('weaker', 0.075), ('question', 0.073), ('equations', 0.07), ('passing', 0.07), ('responds', 0.069), ('interpreting', 0.068), ('strength', 0.067), ('memory', 0.067), ('regression', 0.066), ('physicist', 0.066), ('dependent', 0.065), ('imputation', 0.064), ('shalizi', 0.063), ('deeper', 0.062), ('hearing', 0.062), ('assess', 0.062), ('distinguish', 0.061), ('log', 0.06), ('encountered', 0.06), ('hypotheses', 0.059)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

Introduction: Andrew Eppig writes: I’m a physicist by training who is transitioning to the social sciences. I recently came across a reference in the Economist to a paper on IQ and parasites which I read as I have more than a passing interest in IQ research (having read much that you and others (e.g., Shalizi, Wicherts) have written). In this paper I note that the authors find a very high correlation between national IQ and parasite prevalence. The strength of the correlation (-0.76 to -0.82) surprised me, as I’m used to much weaker correlations in the social sciences. To me, it’s a bit too high, suggesting that there are other factors at play or that one of the variables is merely a proxy for a large number of other variables. But I have no basis for this other than a gut feeling and a memory of a plot on Language Log about the distribution of correlation coefficients in social psychology. So my question is this: Is a correlation in the range of (-0.82,-0.76) more likely to be a correlatio

2 0.21391922 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?

Introduction: Majid Ezzati writes: My research group is increasingly focusing on a series of problems that involve data that either have missingness or measurements that may have bias/error. We have at times developed our own approaches to imputation (as simple as interpolating a missing unit and as sophisticated as a problem-specific Bayesian hierarchical model) and at other times, other groups impute the data. The outputs are being used to investigate the basic associations between pairs of variables, Xs and Ys, in regressions; we may or may not interpret these as causal. I am contacting colleagues with relevant expertise to suggest good references on whether having imputed X and/or Y in a subsequent regression is correct or if it could somehow lead to biased/spurious associations. Thinking about this, we can have at least the following situations (these could all be Bayesian or not): 1) X and Y both measured (perhaps with error) 2) Y imputed using some data and a model and X measur

3 0.19675247 799 andrew gelman stats-2011-07-13-Hypothesis testing with multiple imputations

Introduction: Vincent Yip writes: I have read your paper [with Kobi Abayomi and Marc Levy] regarding multiple imputation application. In order to diagnostic my imputed data, I used Kolmogorov-Smirnov (K-S) tests to compare the distribution differences between the imputed and observed values of a single attribute as mentioned in your paper. My question is: For example I have this attribute X with the following data: (NA = missing) Original dataset: 1, NA, 3, 4, 1, 5, NA Imputed dataset: 1, 2 , 3, 4, 1, 5, 6 a) in order to run the KS test, will I treat the observed data as 1, 3, 4,1, 5? b) and for the observed data, will I treat 1, 2 , 3, 4, 1, 5, 6 as the imputed dataset for the K-S test? or just 2 ,6? c) if I used m=5, I will have 5 set of imputed data sets. How would I apply K-S test to 5 of them and compare to the single observed distribution? Do I combine the 5 imputed data set into one by averaging each imputed values so I get one single imputed data and compare with the ob

4 0.17459713 561 andrew gelman stats-2011-02-06-Poverty, educational performance – and can be done about it

Introduction: Andrew has pointed to Jonathan Livengood’s analysis of the correlation between poverty and PISA results, whereby schools with poorer students get poorer test results. I’d have written a comment, but then I couldn’t have inserted a chart. Andrew points out that a causal analysis is needed. This reminds me of an intervention that has been done before: take a child out of poverty, and bring him up in a better-off family. What’s going to happen? There have been several studies examining correlations between adoptive and biological parents’ IQ (assuming IQ is a test analogous to the math and verbal tests, and that parent IQ is analogous to the quality of instruction – but the point is in the analysis not in the metric). This is the result (from Adoption Strategies by Robin P Corley in Encyclopedia of Life Sciences): So, while it did make a difference at an early age, with increasing age of the adopted child, the intelligence of adoptive parents might not be making any difference

5 0.15990552 1910 andrew gelman stats-2013-06-22-Struggles over the criticism of the “cannabis users and IQ change” paper

Introduction: Ole Rogeberg points me to a discussion of a discussion of a paper: Did pre-release of my [Rogeberg's] PNAS paper on methodological problems with Meier et al’s 2012 paper on cannabis and IQ reduce the chances that it will have its intended effect? In my case, serious methodological issues related to causal inference from non-random observational data became framed as a conflict over conclusions, forcing the original research team to respond rapidly and insufficiently to my concerns, and prompting them to defend their conclusions and original paper in a way that makes a later, more comprehensive reanalysis of their data less likely. This fits with a recurring theme on this blog: the defensiveness of researchers who don’t want to admit they were wrong. Setting aside cases of outright fraud and plagiarism, I think the worst case remains that of psychologists Neil Anderson and Deniz Ones, who denied any problems even in the presence of a smoking gun of a graph revealing their data

6 0.13386521 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

7 0.13004081 1474 andrew gelman stats-2012-08-29-More on scaled-inverse Wishart and prior independence

8 0.1290292 1966 andrew gelman stats-2013-08-03-Uncertainty in parameter estimates using multilevel models

9 0.12814157 315 andrew gelman stats-2010-10-03-He doesn’t trust the fit . . . r=.999

10 0.12757735 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

11 0.12417243 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?

12 0.12251225 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

13 0.11935133 2315 andrew gelman stats-2014-05-02-Discovering general multidimensional associations

14 0.1164207 2364 andrew gelman stats-2014-06-08-Regression and causality and variable ordering

15 0.11396313 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

16 0.1124558 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

17 0.11164731 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients

18 0.10957573 704 andrew gelman stats-2011-05-10-Multiple imputation and multilevel analysis

19 0.10941154 2258 andrew gelman stats-2014-03-21-Random matrices in the news

20 0.10679671 1620 andrew gelman stats-2012-12-12-“Teaching effectiveness” as another dimension in cognitive ability


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.176), (1, 0.063), (2, 0.057), (3, -0.049), (4, 0.054), (5, 0.022), (6, 0.02), (7, -0.049), (8, 0.076), (9, 0.096), (10, 0.023), (11, 0.04), (12, 0.003), (13, -0.012), (14, 0.017), (15, 0.034), (16, 0.03), (17, 0.015), (18, 0.005), (19, -0.016), (20, 0.01), (21, 0.022), (22, 0.023), (23, -0.059), (24, 0.038), (25, 0.011), (26, 0.034), (27, -0.043), (28, 0.011), (29, -0.014), (30, 0.081), (31, 0.044), (32, 0.063), (33, 0.046), (34, 0.002), (35, -0.023), (36, 0.084), (37, 0.038), (38, 0.023), (39, -0.042), (40, -0.015), (41, -0.05), (42, 0.072), (43, 0.011), (44, 0.004), (45, -0.015), (46, 0.01), (47, -0.025), (48, 0.061), (49, 0.006)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96342027 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

Introduction: Andrew Eppig writes: I’m a physicist by training who is transitioning to the social sciences. I recently came across a reference in the Economist to a paper on IQ and parasites which I read as I have more than a passing interest in IQ research (having read much that you and others (e.g., Shalizi, Wicherts) have written). In this paper I note that the authors find a very high correlation between national IQ and parasite prevalence. The strength of the correlation (-0.76 to -0.82) surprised me, as I’m used to much weaker correlations in the social sciences. To me, it’s a bit too high, suggesting that there are other factors at play or that one of the variables is merely a proxy for a large number of other variables. But I have no basis for this other than a gut feeling and a memory of a plot on Language Log about the distribution of correlation coefficients in social psychology. So my question is this: Is a correlation in the range of (-0.82,-0.76) more likely to be a correlatio

2 0.79878986 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

Introduction: Mike Johns writes: Are you familiar with the work of Ai and Norton on interactions in logit/probit models? I’d be curious to hear your thoughts. Ai, C.R. and Norton E.C. 2003. Interaction terms in logit and probit models. Economics Letters 80(1): 123-129. A peer ref just cited this paper in reaction to a logistic model we tested and claimed that the “only” way to test an interaction in logit/probit regression is to use the cross derivative method of Ai & Norton. I’ve never heard of this issue or method. It leaves me wondering what the interaction term actually tests (something Ai & Norton don’t discuss) and why such an important discovery is not more widely known. Is this an issue that is of particular relevance to econometric analysis because they approach interactions from the difference-in-difference perspective? Full disclosure, I’m coming from a social science/epi background. Thus, i’m not interested in the d-in-d estimator; I want to know if any variables modify the rela

3 0.77088684 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

Introduction: Fred Schiff writes: I’m writing to you to ask about the “R-squared” approximation procedure you suggest in your 2004 book with Dr. Hill. [See also this paper with Pardoe---ed.] I’m a media sociologist at the University of Houston. I’ve been using HLM3 for about two years. Briefly about my data. It’s a content analysis of news stories with a continuous scale dependent variable, story prominence. I have 6090 news stories, 114 newspapers, and 59 newspaper group owners. All the Level-1, Level-2 and dependent variables have been standardized. Since the means were zero anyway, we left the variables uncentered. All the Level-3 ownership groups and characteristics are dichotomous scales that were left uncentered. PROBLEM: The single most important result I am looking for is to compare the strength of nine competing Level-1 variables in their ability to predict and explain the outcome variable, story prominence. We are trying to use the residuals to calculate a “R-squ

4 0.76677066 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

Introduction: Andy Flies, Ph.D. candidate in zoology, writes: After reading your paper about scaling regression inputs by two standard deviations I found your blog post stating that you wished you had scaled by 1 sd and coded the binary inputs as -1 and 1. Here is my question: If you code the binary input as -1 and 1, do you then standardize it? This makes sense to me because the mean of the standardized input is then zero and the sd is 1, which is what the mean and sd are for all of the other standardized inputs. I know that if you code the binary input as 0 and 1 it should not be standardized. Also, I am not interested in the actual units (i.e. mg/ml) of my response variable and I would like to compare a couple of different response variables that are on different scales. Would it make sense to standardize the response variable also? My reply: No, I don’t standardize the binary input. The point of standardizing inputs is to make the coefs directly interpretable, but with binary i

5 0.7460857 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit

6 0.74331623 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

7 0.74049312 1870 andrew gelman stats-2013-05-26-How to understand coefficients that reverse sign when you start controlling for things?

8 0.7379697 14 andrew gelman stats-2010-05-01-Imputing count data

9 0.73736989 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

10 0.73721886 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

11 0.73477644 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

12 0.73377585 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

13 0.72724533 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

14 0.72623038 301 andrew gelman stats-2010-09-28-Correlation, prediction, variation, etc.

15 0.72491515 251 andrew gelman stats-2010-09-02-Interactions of predictors in a causal model

16 0.71499795 1663 andrew gelman stats-2013-01-09-The effects of fiscal consolidation

17 0.7013765 1918 andrew gelman stats-2013-06-29-Going negative

18 0.69888896 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

19 0.69828218 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

20 0.69662052 553 andrew gelman stats-2011-02-03-is it possible to “overstratify” when assigning a treatment in a randomized control trial?


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.011), (16, 0.105), (21, 0.029), (24, 0.137), (56, 0.018), (58, 0.018), (61, 0.024), (76, 0.166), (86, 0.056), (99, 0.308)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.96741438 988 andrew gelman stats-2011-11-02-Roads, traffic, and the importance in decision analysis of carefully examining your goals

Introduction: Sandeep Baliga writes : [In a recent study , Gilles Duranton and Matthew Turner write:] For interstate highways in metropolitan areas we [Duranton and Turner] find that VKT (vehicle kilometers traveled) increases one for one with interstate highways, confirming the fundamental law of highway congestion.’ Provision of public transit also simply leads to the people taking public transport being replaced by drivers on the road. Therefore: These findings suggest that both road capacity expansions and extensions to public transit are not appropriate policies with which to combat traffic congestion. This leaves congestion pricing as the main candidate tool to curb traffic congestion. To which I reply: Sure, if your goal is to curb traffic congestion . But what sort of goal is that? Thinking like a microeconomist, my policy goal is to increase people’s utility. Sure, traffic congestion is annoying, but there must be some advantages to driving on that crowded road or pe

2 0.96143454 300 andrew gelman stats-2010-09-28-A calibrated Cook gives Dems the edge in Nov, sez Sandy

Introduction: Sandy Gordon sends along this fun little paper forecasting the 2010 midterm election using expert predictions (the Cook and Rothenberg Political Reports). Gordon’s gimmick is that he uses past performance to calibrate the reports’ judgments based on “solid,” “likely,” “leaning,” and “toss-up” categories, and then he uses the calibrated versions of the current predictions to make his forecast. As I wrote a few weeks ago in response to Nate’s forecasts, I think the right way to go, if you really want to forecast the election outcome, is to use national information to predict the national swing and then do regional, state, and district-level adjustments using whatever local information is available. I don’t see the point of using only the expert forecasts and no other data. Still, Gordon is bringing new information (his calibrations) to the table, so I wanted to share it with you. Ultimately I like the throw-in-everything approach that Nate uses (although I think Nate’s descr

3 0.9575752 1351 andrew gelman stats-2012-05-29-A Ph.D. thesis is not really a marathon

Introduction: Thomas Basbøll writes : A blog called The Thesis Whisperer was recently pointed out to me. I [Basbøll] haven’t looked at it closely, but I’ll be reading it regularly for a while before I recommend it. I’m sure it’s a good place to go to discover that you’re not alone, especially when you’re struggling with your dissertation. One post caught my eye immediately. It suggested that writing a thesis is not a sprint, it’s a marathon. As a metaphorical adjustment to a particular attitude about writing, it’s probably going to help some people. But if we think it through, it’s not really a very good analogy. No one is really a “sprinter”; and writing a dissertation is nothing like running a marathon. . . . Here’s Ben’s explication of the analogy at the Thesis Whisperer, which seems initially plausible. …writing a dissertation is a lot like running a marathon. They are both endurance events, they last a long time and they require a consistent and carefully calculated amount of effor

4 0.95605201 1551 andrew gelman stats-2012-10-28-A convenience sample and selected treatments

Introduction: Charlie Saunders writes: A study has recently been published in the New England Journal of Medicine (NEJM) which uses survival analysis to examine long-acting reversible contraception (e.g. intrauterine devices [IUDs]) vs. short-term commonly prescribed methods of contraception (e.g. oral contraceptive pills) on unintended pregnancies. The authors use a convenience sample of over 7,000 women. I am not well versed-enough in sampling theory to determine the appropriateness of this but it would seem that the use of a non-probability sampling would be a significant drawback. If you could give me your opinion on this, I would appreciate it. The NEJM is one of the top medical journals in the country. Could this type of sampling method coupled with this method of analysis be published in a journal like JASA? My reply: There are two concerns, first that it is a convenience sample and thus not representative of the population, and second that the treatments are chosen rather tha

5 0.95443177 1835 andrew gelman stats-2013-05-02-7 ways to separate errors from statistics

Introduction: Betsey Stevenson and Justin Wolfers have been inspired by the recent Reinhardt and Rogoff debacle to list “six ways to separate lies from statistics” in economics research: 1. “Focus on how robust a finding is, meaning that different ways of looking at the evidence point to the same conclusion.” 2. Don’t confuse statistical with practical significance. 3. “Be wary of scholars using high-powered statistical techniques as a bludgeon to silence critics who are not specialists.” 4. “Don’t fall into the trap of thinking about an empirical finding as ‘right’ or ‘wrong.’ At best, data provide an imperfect guide.” 5. “Don’t mistake correlation for causation.” 6. “Always ask ‘so what?’” I like all these points, especially #4, which I think doesn’t get said enough. As I wrote a few months ago, high-profile social science research aims for proof, not for understanding—and that’s a problem. My addition to the list If you compare my title above to that of Stevenson

6 0.95388591 1850 andrew gelman stats-2013-05-10-The recursion of pop-econ

7 0.95163369 283 andrew gelman stats-2010-09-17-Vote Buying: Evidence from a List Experiment in Lebanon

8 0.95140517 1609 andrew gelman stats-2012-12-06-Stephen Kosslyn’s principles of graphics and one more: There’s no need to cram everything into a single plot

same-blog 9 0.95024097 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

10 0.94092762 337 andrew gelman stats-2010-10-12-Election symposium at Columbia Journalism School

11 0.93940932 32 andrew gelman stats-2010-05-14-Causal inference in economics

12 0.93703759 1818 andrew gelman stats-2013-04-22-Goal: Rules for Turing chess

13 0.92989516 2246 andrew gelman stats-2014-03-13-An Economist’s Guide to Visualizing Data

14 0.92915541 1600 andrew gelman stats-2012-12-01-$241,364.83 – $13,000 = $228,364.83

15 0.92807698 368 andrew gelman stats-2010-10-25-Is instrumental variables analysis particularly susceptible to Type M errors?

16 0.92526937 922 andrew gelman stats-2011-09-24-Economists don’t think like accountants—but maybe they should

17 0.92502314 51 andrew gelman stats-2010-05-26-If statistics is so significantly great, why don’t statisticians use statistics?

18 0.92499363 608 andrew gelman stats-2011-03-12-Single or multiple imputation?

19 0.9216401 1105 andrew gelman stats-2012-01-08-Econ debate about prices at a fancy restaurant

20 0.91050482 2013 andrew gelman stats-2013-09-08-What we need here is some peer review for statistical graphics