andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1218 knowledge-graph by maker-knowledge-mining

1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation


meta infos for this blog

Source: html

Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. [sent-1, score-0.66]

2 I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. [sent-2, score-1.332]

3 I want only this variable to be imputed with as little bias as possible as I am not using any other variable. [sent-3, score-0.504]

4 So I scoured the survey for every variable I thought could possibly predict parent education. [sent-4, score-1.17]

5 The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. [sent-5, score-2.282]

6 I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. [sent-6, score-2.47]

7 My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant predictors in the complete case sample, have very small coefficients. [sent-7, score-1.75]

8 Is my method of including all the variables that were statistically significant predictors in the imputation model a valid strategy for deciding what to include in the imputation? [sent-9, score-1.014]

9 My reply: Your imputation plan seems reasonable. [sent-10, score-0.354]

10 To check it, you can do some cross-validation: randomly remove 1/5 (say) of the observations for your variable of interest, run the algorithm, then compare the held-out values to the random imputations. [sent-11, score-0.886]

11 We did some of this in our 1998 paper but I still haven’t gotten around to formalizing the method. [sent-12, score-0.182]

12 The cross-validation check won’t save you if you have serious nonignorable missingness (for example, large values more likely than small values to be misreported), but it can be thought of as a minimal check. [sent-13, score-0.903]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('parent', 0.585), ('imputation', 0.297), ('variable', 0.289), ('education', 0.2), ('values', 0.196), ('complete', 0.173), ('check', 0.141), ('variables', 0.13), ('explains', 0.128), ('survey', 0.125), ('misreported', 0.118), ('students', 0.118), ('predictors', 0.112), ('variance', 0.107), ('elena', 0.107), ('formalizing', 0.107), ('occupation', 0.103), ('ice', 0.103), ('missingness', 0.1), ('statistically', 0.098), ('found', 0.097), ('significant', 0.089), ('imputed', 0.088), ('include', 0.084), ('iterative', 0.084), ('deciding', 0.082), ('interest', 0.081), ('regression', 0.076), ('minimal', 0.075), ('stata', 0.075), ('gotten', 0.075), ('implemented', 0.074), ('using', 0.074), ('remove', 0.073), ('small', 0.07), ('randomly', 0.068), ('save', 0.066), ('observations', 0.065), ('valid', 0.064), ('algorithm', 0.062), ('predicting', 0.062), ('schools', 0.06), ('thought', 0.059), ('strategy', 0.058), ('possibly', 0.057), ('plan', 0.057), ('predict', 0.055), ('currently', 0.055), ('compare', 0.054), ('bias', 0.053)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant

2 0.20282371 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

Introduction: Steve Miller writes: Much of what I do is cross-national analyses of survey data (largely World Values Survey). . . . My big question pertains to (what I would call) exploratory analysis of multilevel data, especially when the group-level predictors are of theoretical importance. A lot of what I do involves analyzing cross-national survey items of citizen attitudes, typically of political leadership. These survey items are usually yes/no responses, or four-part responses indicating a level of agreement (strongly agree, agree, disagree, strongly disagree) that can be condensed into a binary variable. I believe these can be explained by reference to country-level factors. Much of the group-level variables of interest are count variables with a modal value of 0, which can be quite messy. How would you recommend exploring the variation in the dependent variable as it could be explained by the group-level count variable of interest, before fitting the multilevel model itself? When

3 0.17933187 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?

Introduction: Majid Ezzati writes: My research group is increasingly focusing on a series of problems that involve data that either have missingness or measurements that may have bias/error. We have at times developed our own approaches to imputation (as simple as interpolating a missing unit and as sophisticated as a problem-specific Bayesian hierarchical model) and at other times, other groups impute the data. The outputs are being used to investigate the basic associations between pairs of variables, Xs and Ys, in regressions; we may or may not interpret these as causal. I am contacting colleagues with relevant expertise to suggest good references on whether having imputed X and/or Y in a subsequent regression is correct or if it could somehow lead to biased/spurious associations. Thinking about this, we can have at least the following situations (these could all be Bayesian or not): 1) X and Y both measured (perhaps with error) 2) Y imputed using some data and a model and X measur

4 0.17825645 1460 andrew gelman stats-2012-08-16-“Real data can be a pain”

Introduction: Michael McLaughlin sent me the following query with the above title. Some time ago, I [McLaughlin] was handed a dataset that needed to be modeled. It was generated as follows: 1. Random navigation errors, historically a binary mixture of normal and Laplace with a common mean, were collected by observation. 2. Sadly, these data were recorded with too few decimal places so that the resulting quantization is clearly visible in a scatterplot. 3. The quantized data were then interpolated (to an unobserved location). The final result looks like fuzzy points (small scale jitter) at quantized intervals spanning a much larger scale (the parent mixture distribution). This fuzziness, likely ~normal or ~Laplace, results from the interpolation. Otherwise, the data would look like a discrete analogue of the normal/Laplace mixture. I would like to characterize the latent normal/Laplace mixture distribution but the quantization is “getting in the way”. When I tried MCMC on this proble

5 0.16145864 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

Introduction: Aureliano Crameri writes: I have questions regarding one technique you and your colleagues described in your papers: the cross validation (Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box, with reference to Gelman, King, and Liu, 1998). I think this is the technique I need for my purpose, but I am not sure I understand it right. I want to use the multiple imputation to estimate the outcome of psychotherapies based on longitudinal data. First I have to demonstrate that I am able to get unbiased estimates with the multiple imputation. The expected bias is the overestimation of the outcome of dropouts. I will test my imputation strategies by means of a series of simulations (delete values, impute, compare with the original). Due to the complexity of the statistical analyses I think I need at least 200 cases. Now I don’t have so many cases without any missings. My data have missing values in different variables. The proportion of missing values is

6 0.15966335 608 andrew gelman stats-2011-03-12-Single or multiple imputation?

7 0.12977213 1344 andrew gelman stats-2012-05-25-Question 15 of my final exam for Design and Analysis of Sample Surveys

8 0.1297276 709 andrew gelman stats-2011-05-13-D. Kahneman serves up a wacky counterfactual

9 0.12671609 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys

10 0.12351073 799 andrew gelman stats-2011-07-13-Hypothesis testing with multiple imputations

11 0.11730033 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

12 0.1159635 315 andrew gelman stats-2010-10-03-He doesn’t trust the fit . . . r=.999

13 0.11578084 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

14 0.11396313 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

15 0.11250875 1248 andrew gelman stats-2012-04-06-17 groups, 6 group-level predictors: What to do?

16 0.11136523 1341 andrew gelman stats-2012-05-24-Question 14 of my final exam for Design and Analysis of Sample Surveys

17 0.10789187 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?

18 0.10464972 1142 andrew gelman stats-2012-01-29-Difficulties with the 1-4-power transformation

19 0.10193317 2296 andrew gelman stats-2014-04-19-Index or indicator variables

20 0.10187151 1472 andrew gelman stats-2012-08-28-Migrating from dot to underscore


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.146), (1, 0.079), (2, 0.105), (3, -0.079), (4, 0.132), (5, 0.091), (6, 0.003), (7, 0.005), (8, 0.04), (9, 0.052), (10, 0.064), (11, 0.003), (12, -0.012), (13, 0.009), (14, 0.032), (15, 0.015), (16, 0.016), (17, 0.0), (18, -0.014), (19, -0.005), (20, -0.011), (21, 0.077), (22, 0.017), (23, -0.021), (24, -0.001), (25, -0.005), (26, 0.04), (27, -0.055), (28, -0.01), (29, -0.037), (30, 0.065), (31, 0.057), (32, 0.043), (33, 0.085), (34, -0.034), (35, 0.023), (36, 0.028), (37, 0.06), (38, -0.052), (39, 0.01), (40, -0.051), (41, -0.022), (42, 0.035), (43, 0.027), (44, 0.016), (45, -0.004), (46, 0.004), (47, 0.014), (48, 0.011), (49, 0.065)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98020154 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant

2 0.8577255 14 andrew gelman stats-2010-05-01-Imputing count data

Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.

3 0.78577423 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

Introduction: Steve Miller writes: Much of what I do is cross-national analyses of survey data (largely World Values Survey). . . . My big question pertains to (what I would call) exploratory analysis of multilevel data, especially when the group-level predictors are of theoretical importance. A lot of what I do involves analyzing cross-national survey items of citizen attitudes, typically of political leadership. These survey items are usually yes/no responses, or four-part responses indicating a level of agreement (strongly agree, agree, disagree, strongly disagree) that can be condensed into a binary variable. I believe these can be explained by reference to country-level factors. Much of the group-level variables of interest are count variables with a modal value of 0, which can be quite messy. How would you recommend exploring the variation in the dependent variable as it could be explained by the group-level count variable of interest, before fitting the multilevel model itself? When

4 0.77314413 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

Introduction: Brandon Behlendorf writes: I [Behlendorf] am replicating some previous research using OLS [he's talking about what we call "linear regression"---ed.] to regress a logged rate (to reduce skew) of Y on a number of predictors (Xs). Y is the count of a phenomena divided by the population of the unit of the analysis. The problem that I am encountering is that Y is composite count of a number of distinct phenomena [A+B+C], and these phenomena are not uniformly distributed across the sample. Most of the research in this area has conducted regressions either with Y or with individual phenomena [A or B or C] as the dependent variable. Yet it seems that if [A, B, C] are not uniformly distributed across the sample of units in the same proportion, then the use of Y would be biased, since as a count of [A+B+C] divided by the population, it would treat as equivalent units both [2+0.5+1.5] and [4+0+0]. My goal is trying to find a methodology which allows a researcher to regress Y on a

5 0.76119363 627 andrew gelman stats-2011-03-24-How few respondents are reasonable to use when calculating the average by county?

Introduction: Sam Stroope writes: I’m creating county-level averages based on individual-level respondents. My question is, how few respondents are reasonable to use when calculating the average by county? My end model will be a county-level (only) SEM model. My reply: Any number of respondents should work. If you have very few respondents, you should just end up with large standard errors which will propagate through your analysis. P.S. I must have deleted my original reply by accident so I reconstructed something above.

6 0.74959838 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

7 0.73810029 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

8 0.72281176 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

9 0.72043025 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

10 0.70496351 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

11 0.69305241 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making

12 0.68710238 1340 andrew gelman stats-2012-05-23-Question 13 of my final exam for Design and Analysis of Sample Surveys

13 0.68565357 251 andrew gelman stats-2010-09-02-Interactions of predictors in a causal model

14 0.68131846 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?

15 0.67613441 2296 andrew gelman stats-2014-04-19-Index or indicator variables

16 0.67523217 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

17 0.6695593 1344 andrew gelman stats-2012-05-25-Question 15 of my final exam for Design and Analysis of Sample Surveys

18 0.66858739 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

19 0.66127682 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

20 0.65869433 315 andrew gelman stats-2010-10-03-He doesn’t trust the fit . . . r=.999


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(13, 0.011), (15, 0.052), (16, 0.123), (21, 0.018), (24, 0.171), (47, 0.091), (68, 0.015), (76, 0.012), (86, 0.032), (94, 0.031), (96, 0.011), (99, 0.31)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97788942 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant

2 0.97042662 1285 andrew gelman stats-2012-04-27-“How to Lie with Statistics” guy worked for the tobacco industry to mock studies of the risks of smoking statistics

Introduction: Remember How to Lie With Statistics? It turns out that the author worked for the cigarette companies. John Mashey points to this, from Robert Proctor’s book, “Golden Holocaust: Origins of the Cigarette Catastrophe and the Case for Abolition”: Darrell Huff, author of the wildly popular (and aptly named) How to Lie With Statistics, was paid to testify before Congress in the 1950s and then again in the 1960s, with the assigned task of ridiculing any notion of a cigarette-disease link. On March 22, 1965, Huff testified at hearings on cigarette labeling and advertising, accusing the recent Surgeon General’s report of myriad failures and “fallacies.” Huff peppered his attack with with amusing asides and anecdotes, lampooning spurious correlations like that between the size of Dutch families and the number of storks nesting on rooftops–which proves not that storks bring babies but rather that people with large families tend to have larger houses (which therefore attract more storks).

3 0.96981812 548 andrew gelman stats-2011-02-01-What goes around . . .

Introduction: A few weeks ago I delivered a 10-minute talk on statistical graphics that went so well, it was the best-received talk I’ve ever given. The crowd was raucous. Then some poor sap had to go on after me. He started by saying that my talk was a hard act to follow. And, indeed, the audience politely listened but did not really get involved in his presentation. Boy did I feel smug. More recently I gave a talk on Stan, at an entirely different venue. And this time the story was the exact opposite. Jim Demmel spoke first and gave a wonderful talk on optimization for linear algebra (it was an applied math conference). Then I followed, and I never really grabbed the crowd. My talk was not a disaster but it didn’t really work. This was particularly frustrating because I’m really excited about Stan and this was a group of researchers I wouldn’t usually have a chance to reach. It was the plenary session at the conference. Anyway, now I know how that guy felt from last month. My talk

4 0.96618736 2183 andrew gelman stats-2014-01-23-Discussion on preregistration of research studies

Introduction: Chris Chambers and I had an enlightening discussion the other day at the blog of Rolf Zwaan, regarding the Garden of Forking Paths ( go here and scroll down through the comments). Chris sent me the following note: I’m writing a book at the moment about reforming practices in psychological research (focusing on various bad practices such as p-hacking, HARKing, low statistical power, publication bias, lack of data sharing etc. – and posing solutions such as pre-registration, Bayesian hypothesis testing, mandatory data archiving etc.) and I am arriving at rather unsettling conclusion: that null hypothesis significance testing (NHST) simply isn’t valid for observational research. If this is true then most of the psychological literature is statistically flawed. I was wonder what your thoughts were on this, both from a statistical point of view and from your experience working in an observational field. We all know about the dangers of researcher degrees of freedom. We also know

5 0.96492678 1273 andrew gelman stats-2012-04-20-Proposals for alternative review systems for scientific work

Introduction: I recently became aware of two new entries in the ever-popular genre of, Our Peer-Review System is in Trouble; How Can We Fix It? Political scientist Brendan Nyhan, commenting on experimental and empirical sciences more generally, focuses on the selection problem that positive rather then negative findings tend to get published, leading via the statistical significance filter to an overestimation of effect sizes. Nyhan recommends that data-collection protocols be published ahead of time, with the commitment to publish the eventual results: In the case of experimental data, a better practice would be for journals to accept articles before the study was conducted. The article should be written up to the point of the results section, which would then be populated using a pre-specified analysis plan submitted by the author. The journal would then allow for post-hoc analysis and interpretation by the author that would be labeled as such and distinguished from the previously submit

6 0.96444142 1730 andrew gelman stats-2013-02-20-Unz on Unz

7 0.96339715 1261 andrew gelman stats-2012-04-12-The Naval Research Lab

8 0.96283102 95 andrew gelman stats-2010-06-17-“Rewarding Strivers: Helping Low-Income Students Succeed in College”

9 0.96016139 94 andrew gelman stats-2010-06-17-SAT stories

10 0.95777631 2270 andrew gelman stats-2014-03-28-Creating a Lenin-style democracy

11 0.95677793 1760 andrew gelman stats-2013-03-12-Misunderstanding the p-value

12 0.95587206 716 andrew gelman stats-2011-05-17-Is the internet causing half the rapes in Norway? I wanna see the scatterplot.

13 0.95521027 1055 andrew gelman stats-2011-12-13-Data sharing update

14 0.95514703 1450 andrew gelman stats-2012-08-08-My upcoming talk for the data visualization meetup

15 0.9549855 586 andrew gelman stats-2011-02-23-A statistical version of Arrow’s paradox

16 0.95482647 481 andrew gelman stats-2010-12-22-The Jumpstart financial literacy survey and the different purposes of tests

17 0.95476174 2137 andrew gelman stats-2013-12-17-Replication backlash

18 0.95472252 438 andrew gelman stats-2010-11-30-I just skyped in from Kentucky, and boy are my arms tired

19 0.95460439 2179 andrew gelman stats-2014-01-20-The AAA Tranche of Subprime Science

20 0.95408857 2227 andrew gelman stats-2014-02-27-“What Can we Learn from the Many Labs Replication Project?”