andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1121 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Fred Schiff writes: I’m writing to you to ask about the “R-squared” approximation procedure you suggest in your 2004 book with Dr. Hill. [See also this paper with Pardoe---ed.] I’m a media sociologist at the University of Houston. I’ve been using HLM3 for about two years. Briefly about my data. It’s a content analysis of news stories with a continuous scale dependent variable, story prominence. I have 6090 news stories, 114 newspapers, and 59 newspaper group owners. All the Level-1, Level-2 and dependent variables have been standardized. Since the means were zero anyway, we left the variables uncentered. All the Level-3 ownership groups and characteristics are dichotomous scales that were left uncentered. PROBLEM: The single most important result I am looking for is to compare the strength of nine competing Level-1 variables in their ability to predict and explain the outcome variable, story prominence. We are trying to use the residuals to calculate a “R-squ
sentIndex sentText sentNum sentScore
1 Fred Schiff writes: I’m writing to you to ask about the “R-squared” approximation procedure you suggest in your 2004 book with Dr. [sent-1, score-0.057]
2 It’s a content analysis of news stories with a continuous scale dependent variable, story prominence. [sent-7, score-0.336]
3 I have 6090 news stories, 114 newspapers, and 59 newspaper group owners. [sent-8, score-0.32]
4 All the Level-1, Level-2 and dependent variables have been standardized. [sent-9, score-0.375]
5 Since the means were zero anyway, we left the variables uncentered. [sent-10, score-0.341]
6 All the Level-3 ownership groups and characteristics are dichotomous scales that were left uncentered. [sent-11, score-1.03]
7 PROBLEM: The single most important result I am looking for is to compare the strength of nine competing Level-1 variables in their ability to predict and explain the outcome variable, story prominence. [sent-12, score-0.44]
8 We are trying to use the residuals to calculate a “R-squared” measure for each level as you and Hill proposed. [sent-13, score-0.323]
9 We haven’t been able to generate OLS regression equations for each newspaper and ownership group in HLM because the manual suggests “optional settings” that are not available in our software (HLM 6. [sent-14, score-1.331]
10 QUESTION-1 – How could we generate the estimated Bayesian residuals for level-1? [sent-16, score-0.24]
11 QUESTION-2 – Is it legitimate to run a model where Level-1 and Level-2 variables are standardized and Level-3 variables are dichotomous dummy variables? [sent-17, score-1.089]
12 QUESTION-3 – Is it legitimate to run models to estimate parameters for each ownership group and at the same time include the corresponding dummy variables as part of the data structure? [sent-18, score-1.447]
13 QUESTION-4 – In equations that include Level-3 variables, is it valid to describe the results as applying selectively to the stories (L1) in newspapers (L2) owned by one ownership group (L3, coded 1) as opposed to stories in newspapers of other ownership groups (L3, coded 0)? [sent-19, score-2.745]
14 I don’t know the HLM software so I don’t know how to use it to compute the Bayesian residuals. [sent-21, score-0.198]
15 But you might be happy to hear that we are currently working on implementing these ideas using the lmer/glmer software in R. [sent-22, score-0.192]
16 Once it’s been programmed in one package, it shouldn’t be hard for people to translate it into another. [sent-23, score-0.136]
17 When in doubt, interpret coefficients by considering predictions with inputs set to various reasonable fixed values. [sent-26, score-0.125]
18 If you have all the data loaded in, you should be able to use ownership group as a level and also include predictors at that level. [sent-29, score-1.109]
19 I think this is reasonable but I’m not following all the details. [sent-31, score-0.062]
20 That’s one trick we use in our book on occasion. [sent-33, score-0.123]
wordName wordTfidf (topN-words)
[('ownership', 0.59), ('variables', 0.262), ('hlm', 0.253), ('newspapers', 0.185), ('group', 0.162), ('dichotomous', 0.162), ('dummy', 0.162), ('stories', 0.159), ('residuals', 0.138), ('coded', 0.134), ('software', 0.129), ('equations', 0.122), ('dependent', 0.113), ('legitimate', 0.108), ('generate', 0.102), ('include', 0.095), ('newspaper', 0.094), ('optional', 0.084), ('doubt', 0.081), ('left', 0.079), ('ols', 0.078), ('groups', 0.078), ('owned', 0.076), ('selectively', 0.076), ('variable', 0.073), ('loaded', 0.072), ('programmed', 0.071), ('manual', 0.069), ('fred', 0.069), ('use', 0.069), ('run', 0.068), ('scales', 0.066), ('standardized', 0.065), ('translate', 0.065), ('nine', 0.064), ('news', 0.064), ('able', 0.063), ('implementing', 0.063), ('inputs', 0.063), ('reasonable', 0.062), ('occasion', 0.059), ('strength', 0.059), ('level', 0.058), ('calculate', 0.058), ('approximation', 0.057), ('sociologist', 0.057), ('competing', 0.055), ('characteristics', 0.055), ('hill', 0.055), ('trick', 0.054)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000002 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models
Introduction: Fred Schiff writes: I’m writing to you to ask about the “R-squared” approximation procedure you suggest in your 2004 book with Dr. Hill. [See also this paper with Pardoe---ed.] I’m a media sociologist at the University of Houston. I’ve been using HLM3 for about two years. Briefly about my data. It’s a content analysis of news stories with a continuous scale dependent variable, story prominence. I have 6090 news stories, 114 newspapers, and 59 newspaper group owners. All the Level-1, Level-2 and dependent variables have been standardized. Since the means were zero anyway, we left the variables uncentered. All the Level-3 ownership groups and characteristics are dichotomous scales that were left uncentered. PROBLEM: The single most important result I am looking for is to compare the strength of nine competing Level-1 variables in their ability to predict and explain the outcome variable, story prominence. We are trying to use the residuals to calculate a “R-squ
2 0.15778491 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?
Introduction: Yi-Chun Ou writes: I am using a multilevel model with three levels. I read that you wrote a book about multilevel models, and wonder if you can solve the following question. The data structure is like this: Level one: customer (8444 customers) Level two: companys (90 companies) Level three: industry (17 industries) I use 6 level-three variables (i.e. industry characteristics) to explain the variance of the level-one effect across industries. The question here is whether there is an over-fitting problem since there are only 17 industries. I understand that this must be a problem for non-multilevel models, but is it also a problem for multilevel models? My reply: Yes, this could be a problem. I’d suggest combining some of your variables into a common score, or using only some of the variables, or using strong priors to control the inferences. This is an interesting and important area of statistics research, to do this sort of thing systematically. There’s lots o
3 0.12701614 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance
Introduction: Steve Miller writes: Much of what I do is cross-national analyses of survey data (largely World Values Survey). . . . My big question pertains to (what I would call) exploratory analysis of multilevel data, especially when the group-level predictors are of theoretical importance. A lot of what I do involves analyzing cross-national survey items of citizen attitudes, typically of political leadership. These survey items are usually yes/no responses, or four-part responses indicating a level of agreement (strongly agree, agree, disagree, strongly disagree) that can be condensed into a binary variable. I believe these can be explained by reference to country-level factors. Much of the group-level variables of interest are count variables with a modal value of 0, which can be quite messy. How would you recommend exploring the variation in the dependent variable as it could be explained by the group-level count variable of interest, before fitting the multilevel model itself? When
4 0.12357817 315 andrew gelman stats-2010-10-03-He doesn’t trust the fit . . . r=.999
Introduction: I received the following question from an education researcher: I was wondering if I could ask you a question about an HLM model I’m working on. The basic design is that we have 5 years of 8th grade student achievement data (standardized test scores, this is the dependent variable), 4th grade test scores, demographics (e.g., gender and ethnicity) and status wrt special ed or ELL, etc.. In addition, we have some school- or second-level information such as school averages of the student information, type of school (grade configuration), enrollment and so. In total there are thousands of students and many schools over the 5 years of information. The model we’re using is quite parsimonious, using only 7 student-level effects and 4 school-level effects. What’s puzzling us is that the correlation between predicted and actual is unrealistically high…r=0.999. We’re using the HPMIXED procedure in SAS but that shouldn’t matter. By dropping variables, obviously we can get the corre
5 0.11973918 1228 andrew gelman stats-2012-03-25-Continuous variables in Bayesian networks
Introduction: Antti Rasinen writes: I’m a former undergrad machine learning student and a current software engineer with a Bayesian hobby. Today my two worlds collided. I ask for some enlightenment. On your blog you’ve repeatedly advocated continuous distributions with Bayesian models. Today I read this article by Ricky Ho, who writes: The strength of Bayesian network is it is highly scalable and can learn incrementally because all we do is to count the observed variables and update the probability distribution table. Similar to Neural Network, Bayesian network expects all data to be binary, categorical variable will need to be transformed into multiple binary variable as described above. Numeric variable is generally not a good fit for Bayesian network. The last sentence seems to be at odds with what you’ve said. Sadly, I don’t have enough expertise to say which view of the world is correct. During my undergrad years our team wrote an implementation of the Junction Tree algorithm. We r
6 0.11429317 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making
8 0.10141089 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations
9 0.10045918 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?
10 0.098590277 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?
11 0.098494314 1486 andrew gelman stats-2012-09-07-Prior distributions for regression coefficients
12 0.095261469 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?
13 0.095096178 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation
14 0.086184546 1247 andrew gelman stats-2012-04-05-More philosophy of Bayes
15 0.084697433 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs
16 0.083975725 2296 andrew gelman stats-2014-04-19-Index or indicator variables
17 0.083867133 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary
18 0.083706662 569 andrew gelman stats-2011-02-12-Get the Data
19 0.080965668 1644 andrew gelman stats-2012-12-30-Fixed effects, followed by Bayes shrinkage?
20 0.080448315 1506 andrew gelman stats-2012-09-21-Building a regression model . . . with only 27 data points
topicId topicWeight
[(0, 0.148), (1, 0.073), (2, 0.03), (3, -0.007), (4, 0.069), (5, 0.034), (6, 0.013), (7, -0.047), (8, 0.054), (9, 0.081), (10, 0.016), (11, -0.01), (12, 0.019), (13, -0.017), (14, 0.037), (15, 0.029), (16, -0.017), (17, 0.01), (18, 0.019), (19, 0.004), (20, -0.019), (21, 0.043), (22, 0.01), (23, -0.025), (24, 0.002), (25, -0.007), (26, 0.031), (27, -0.012), (28, -0.004), (29, -0.009), (30, 0.041), (31, 0.037), (32, 0.038), (33, 0.06), (34, 0.019), (35, -0.002), (36, -0.003), (37, 0.037), (38, 0.006), (39, 0.002), (40, -0.006), (41, -0.013), (42, 0.041), (43, 0.019), (44, -0.024), (45, 0.003), (46, 0.026), (47, 0.06), (48, 0.018), (49, 0.021)]
simIndex simValue blogId blogTitle
same-blog 1 0.97075486 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models
Introduction: Fred Schiff writes: I’m writing to you to ask about the “R-squared” approximation procedure you suggest in your 2004 book with Dr. Hill. [See also this paper with Pardoe---ed.] I’m a media sociologist at the University of Houston. I’ve been using HLM3 for about two years. Briefly about my data. It’s a content analysis of news stories with a continuous scale dependent variable, story prominence. I have 6090 news stories, 114 newspapers, and 59 newspaper group owners. All the Level-1, Level-2 and dependent variables have been standardized. Since the means were zero anyway, we left the variables uncentered. All the Level-3 ownership groups and characteristics are dichotomous scales that were left uncentered. PROBLEM: The single most important result I am looking for is to compare the strength of nine competing Level-1 variables in their ability to predict and explain the outcome variable, story prominence. We are trying to use the residuals to calculate a “R-squ
2 0.85130459 14 andrew gelman stats-2010-05-01-Imputing count data
Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.
3 0.81454945 2296 andrew gelman stats-2014-04-19-Index or indicator variables
Introduction: Someone who doesn’t want his name shared (for the perhaps reasonable reason that he’ll “one day not be confused, and would rather my confusion not live on online forever”) writes: I’m exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally clear grasp on the material, but wanted to be sure I haven’t gone astray. The problem in working on involves a multi-nation survey of students, and I’m especially interested in understanding the effects of country, religion, and sex, and the interactions among those factors (using IRT to estimate individual-level ability, then estimating individual, school, and country effects). Following the basic approach laid out in chapter 13 for such interactions between levels, I think I need to create a matrix of indicator variables for religion and sex. Elsewhere in the book, you recommend against indicator variables in favor of a single index variable. Am I right in thinking t
4 0.79945415 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs
Introduction: Andy Flies, Ph.D. candidate in zoology, writes: After reading your paper about scaling regression inputs by two standard deviations I found your blog post stating that you wished you had scaled by 1 sd and coded the binary inputs as -1 and 1. Here is my question: If you code the binary input as -1 and 1, do you then standardize it? This makes sense to me because the mean of the standardized input is then zero and the sd is 1, which is what the mean and sd are for all of the other standardized inputs. I know that if you code the binary input as 0 and 1 it should not be standardized. Also, I am not interested in the actual units (i.e. mg/ml) of my response variable and I would like to compare a couple of different response variables that are on different scales. Would it make sense to standardize the response variable also? My reply: No, I don’t standardize the binary input. The point of standardizing inputs is to make the coefs directly interpretable, but with binary i
5 0.79688883 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations
Introduction: Andrew Eppig writes: I’m a physicist by training who is transitioning to the social sciences. I recently came across a reference in the Economist to a paper on IQ and parasites which I read as I have more than a passing interest in IQ research (having read much that you and others (e.g., Shalizi, Wicherts) have written). In this paper I note that the authors find a very high correlation between national IQ and parasite prevalence. The strength of the correlation (-0.76 to -0.82) surprised me, as I’m used to much weaker correlations in the social sciences. To me, it’s a bit too high, suggesting that there are other factors at play or that one of the variables is merely a proxy for a large number of other variables. But I have no basis for this other than a gut feeling and a memory of a plot on Language Log about the distribution of correlation coefficients in social psychology. So my question is this: Is a correlation in the range of (-0.82,-0.76) more likely to be a correlatio
6 0.79178554 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation
7 0.77919626 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation
8 0.77784097 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance
9 0.77494365 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?
10 0.77092785 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression
12 0.76108807 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c
13 0.75806403 271 andrew gelman stats-2010-09-12-GLM – exposure
14 0.75377595 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary
16 0.75079018 1815 andrew gelman stats-2013-04-20-Displaying inferences from complex models
17 0.74518031 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making
18 0.74188441 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?
19 0.73687351 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health
20 0.73518711 704 andrew gelman stats-2011-05-10-Multiple imputation and multilevel analysis
topicId topicWeight
[(6, 0.026), (15, 0.025), (16, 0.096), (21, 0.037), (24, 0.148), (45, 0.208), (63, 0.013), (66, 0.021), (84, 0.012), (85, 0.022), (99, 0.24)]
simIndex simValue blogId blogTitle
1 0.94150645 791 andrew gelman stats-2011-07-08-Censoring on one end, “outliers” on the other, what can we do with the middle?
Introduction: This post was written by Phil. A medical company is testing a cancer drug. They get a 16 genetically identical (or nearly identical) rats that all have the same kind of tumor, give 8 of them the drug and leave 8 untreated…or maybe they give them a placebo, I don’t know; is there a placebo effect in rats?. Anyway, after a while the rats are killed and examined. If the tumors in the treated rats are smaller than the tumors in the untreated rats, then all of the rats have their blood tested for dozens of different proteins that are known to be associated with tumor growth or suppression. If there is a “significant” difference in one of the protein levels, then the working assumption is that the drug increases or decreases levels of that protein and that may be the mechanism by which the drug affects cancer. All of the above is done on many different cancer types and possibly several different types of rats. It’s just the initial screening: if things look promising, many more tests an
2 0.93624687 310 andrew gelman stats-2010-10-02-The winner’s curse
Introduction: If an estimate is statistically significant, it’s probably an overestimate of the magnitude of your effect. P.S. I think youall know what I mean here. But could someone rephrase it in a more pithy manner? I’d like to include it in our statistical lexicon.
3 0.92639506 999 andrew gelman stats-2011-11-09-I was at a meeting a couple months ago . . .
Introduction: . . . and I decided to amuse myself by writing down all the management-speak words I heard: “grappling” “early prototypes” “technology platform” “building block” “machine learning” “your team” “workspace” “tagging” “data exhaust” “monitoring a particular population” “collective intelligence” “communities of practice” “hackathon” “human resources . . . technologies” Any one or two or three of these phrases might be fine, but put them all together and what you have is a festival of jargon. A hackathon, indeed.
4 0.92399585 1031 andrew gelman stats-2011-11-27-Richard Stallman and John McCarthy
Introduction: After blogging on quirky software pioneer Richard Stallman , I thought it appropriate to write something about recently deceased quirky software pioneer John McCarthy, who, with the exception of being bearded, seems like he was the personal and political opposite of Stallman. Here’s a page I found of Stallman McCarthy quotes (compiled by Neil Craig). It’s a mixture of the reasonable and the unreasonable (ok, I suppose the same could be said of this blog!). I wonder if he and Stallman ever met and, if so, whether they had an extended conversation. It would be like matter and anti-matter! P.S. I flipped through McCarthy’s pages and found one of my pet peeves. See item 3 here , which sounds so plausible but is in fact not true (at least, not according to the National Election Study). As McCarthy’s Stanford colleague Mo Fiorina can tell you, otherwise well-informed people believe all sorts of things about polarization that aren’t so. Labeling groups of Americans as “
5 0.92325711 1325 andrew gelman stats-2012-05-17-More on the difficulty of “preaching what you practice”
Introduction: A couple months ago, in discussing Charles Murray’s argument that America’s social leaders should “preach what they practice” (Murray argues that they—we!—tend to lead good lives of hard work and moderation but are all too tolerant of antisocial and unproductive behavior among the lower classes), I wrote : Murray does not consider the case of Joe Paterno, but in many ways the Penn State football coach fits his story well. Paterno was said to live an exemplary personal and professional life, combining traditional morality with football success—but, by his actions, he showed little concern about the morality of his players and coaches. At a professional level, Paterno rose higher and higher, and in his personal life he was a responsible adult. But he had an increasing disconnect with the real world, to the extent that horrible crimes were occurring nearby (in the physical and social senses) but he was completely insulated from the consequences for many years. Paterno’s story is s
6 0.92135936 543 andrew gelman stats-2011-01-28-NYT shills for personal DNA tests
same-blog 7 0.91739571 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models
8 0.9071281 69 andrew gelman stats-2010-06-04-A Wikipedia whitewash
9 0.90121818 1407 andrew gelman stats-2012-07-06-Statistical inference and the secret ballot
10 0.89576066 1012 andrew gelman stats-2011-11-16-Blog bribes!
11 0.89417052 1504 andrew gelman stats-2012-09-20-Could someone please lock this guy and Niall Ferguson in a room together?
12 0.89050746 1089 andrew gelman stats-2011-12-28-Path sampling for models of varying dimension
13 0.88874316 2189 andrew gelman stats-2014-01-28-History is too important to be left to the history professors
14 0.88491744 206 andrew gelman stats-2010-08-13-Indiemapper makes thematic mapping easy
15 0.88273031 362 andrew gelman stats-2010-10-22-A redrawing of the Red-Blue map in November 2010?
16 0.88016522 673 andrew gelman stats-2011-04-20-Upper-income people still don’t realize they’re upper-income
17 0.87904143 1658 andrew gelman stats-2013-01-07-Free advice from an academic writing coach!
18 0.87576932 1767 andrew gelman stats-2013-03-17-The disappearing or non-disappearing middle class
19 0.86695862 1854 andrew gelman stats-2013-05-13-A Structural Comparison of Conspicuous Consumption in China and the United States
20 0.86637866 192 andrew gelman stats-2010-08-08-Turning pages into data