andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-608 knowledge-graph by maker-knowledge-mining

608 andrew gelman stats-2011-03-12-Single or multiple imputation?


meta infos for this blog

Source: html

Introduction: Vishnu Ganglani writes: It appears that multiple imputation appears to be the best way to impute missing data because of the more accurate quantification of variance. However, when imputing missing data for income values in national household surveys, would you recommend it would be practical to maintain the multiple datasets associated with multiple imputations, or a single imputation method would suffice. I have worked on household survey projects (in Scotland) and in the past gone with suggesting single methods for ease of implementation, but with the availability of open source R software I am think of performing multiple imputation methodologies, but a bit apprehensive because of the complexity and also the need to maintain multiple datasets (ease of implementation). My reply: In many applications I’ve just used a single random imputation to avoid the awkwardness of working with multiple datasets. But if there’s any concern, I’d recommend doing parallel analyses on multipl


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Vishnu Ganglani writes: It appears that multiple imputation appears to be the best way to impute missing data because of the more accurate quantification of variance. [sent-1, score-1.518]

2 However, when imputing missing data for income values in national household surveys, would you recommend it would be practical to maintain the multiple datasets associated with multiple imputations, or a single imputation method would suffice. [sent-2, score-2.817]

3 My reply: In many applications I’ve just used a single random imputation to avoid the awkwardness of working with multiple datasets. [sent-4, score-1.294]

4 But if there’s any concern, I’d recommend doing parallel analyses on multiple imputed datasets and then combining inferences at the end. [sent-5, score-1.254]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('multiple', 0.408), ('imputation', 0.384), ('datasets', 0.27), ('ease', 0.267), ('household', 0.207), ('maintain', 0.203), ('implementation', 0.182), ('single', 0.169), ('scotland', 0.153), ('apprehensive', 0.153), ('methodologies', 0.133), ('awkwardness', 0.133), ('imputations', 0.133), ('recommend', 0.133), ('appears', 0.132), ('quantification', 0.129), ('missing', 0.129), ('impute', 0.126), ('imputing', 0.118), ('imputed', 0.114), ('availability', 0.111), ('parallel', 0.098), ('complexity', 0.095), ('performing', 0.091), ('combining', 0.091), ('projects', 0.087), ('suggesting', 0.086), ('gone', 0.079), ('accurate', 0.078), ('concern', 0.078), ('surveys', 0.077), ('inferences', 0.074), ('software', 0.074), ('source', 0.073), ('applications', 0.072), ('practical', 0.071), ('avoid', 0.07), ('income', 0.068), ('associated', 0.068), ('analyses', 0.066), ('values', 0.063), ('worked', 0.062), ('national', 0.062), ('open', 0.06), ('random', 0.058), ('method', 0.056), ('past', 0.054), ('survey', 0.054), ('end', 0.052), ('however', 0.052)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 608 andrew gelman stats-2011-03-12-Single or multiple imputation?

Introduction: Vishnu Ganglani writes: It appears that multiple imputation appears to be the best way to impute missing data because of the more accurate quantification of variance. However, when imputing missing data for income values in national household surveys, would you recommend it would be practical to maintain the multiple datasets associated with multiple imputations, or a single imputation method would suffice. I have worked on household survey projects (in Scotland) and in the past gone with suggesting single methods for ease of implementation, but with the availability of open source R software I am think of performing multiple imputation methodologies, but a bit apprehensive because of the complexity and also the need to maintain multiple datasets (ease of implementation). My reply: In many applications I’ve just used a single random imputation to avoid the awkwardness of working with multiple datasets. But if there’s any concern, I’d recommend doing parallel analyses on multipl

2 0.31310293 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

Introduction: Aureliano Crameri writes: I have questions regarding one technique you and your colleagues described in your papers: the cross validation (Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box, with reference to Gelman, King, and Liu, 1998). I think this is the technique I need for my purpose, but I am not sure I understand it right. I want to use the multiple imputation to estimate the outcome of psychotherapies based on longitudinal data. First I have to demonstrate that I am able to get unbiased estimates with the multiple imputation. The expected bias is the overestimation of the outcome of dropouts. I will test my imputation strategies by means of a series of simulations (delete values, impute, compare with the original). Due to the complexity of the statistical analyses I think I need at least 200 cases. Now I don’t have so many cases without any missings. My data have missing values in different variables. The proportion of missing values is

3 0.24531794 1344 andrew gelman stats-2012-05-25-Question 15 of my final exam for Design and Analysis of Sample Surveys

Introduction: 15. A researcher conducts a random-digit-dial survey of individuals and married couples. The design is as follows: if only one person lives in a household, he or she is interviewed. If there are multiple adults in the household, one is selected at random: he or she is interviewed and, if he or she is married to one of the other adults in the household, the spouse is interviewed as well. Come up with a scheme for inverse-probability weights (ignoring nonresponse and assuming there is exactly one phone line per household). Solution to question 14 From yesterday : 14. A public health survey of elderly Americans includes many questions, including “How many hours per week did you exercise in your most active years as a young adult?” and also several questions about current mobility and health status. Response rates are high for the questions about recent activities and status, but there is a lot of nonresponse for the question on past activity. You are considering imputing the mis

4 0.22508955 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?

Introduction: Majid Ezzati writes: My research group is increasingly focusing on a series of problems that involve data that either have missingness or measurements that may have bias/error. We have at times developed our own approaches to imputation (as simple as interpolating a missing unit and as sophisticated as a problem-specific Bayesian hierarchical model) and at other times, other groups impute the data. The outputs are being used to investigate the basic associations between pairs of variables, Xs and Ys, in regressions; we may or may not interpret these as causal. I am contacting colleagues with relevant expertise to suggest good references on whether having imputed X and/or Y in a subsequent regression is correct or if it could somehow lead to biased/spurious associations. Thinking about this, we can have at least the following situations (these could all be Bayesian or not): 1) X and Y both measured (perhaps with error) 2) Y imputed using some data and a model and X measur

5 0.2013564 1989 andrew gelman stats-2013-08-20-Correcting for multiple comparisons in a Bayesian regression model

Introduction: Joe Northrup writes: I have a question about correcting for multiple comparisons in a Bayesian regression model. I believe I understand the argument in your 2012 paper in Journal of Research on Educational Effectiveness that when you have a hierarchical model there is shrinkage of estimates towards the group-level mean and thus there is no need to add any additional penalty to correct for multiple comparisons. In my case I do not have hierarchically structured data—i.e. I have only 1 observation per group but have a categorical variable with a large number of categories. Thus, I am fitting a simple multiple regression in a Bayesian framework. Would putting a strong, mean 0, multivariate normal prior on the betas in this model accomplish the same sort of shrinkage (it seems to me that it would) and do you believe this is a valid way to address criticism of multiple comparisons in this setting? My reply: Yes, I think this makes sense. One way to address concerns of multiple com

6 0.19152057 704 andrew gelman stats-2011-05-10-Multiple imputation and multilevel analysis

7 0.18639579 799 andrew gelman stats-2011-07-13-Hypothesis testing with multiple imputations

8 0.17519219 1016 andrew gelman stats-2011-11-17-I got 99 comparisons but multiplicity ain’t one

9 0.15966335 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

10 0.13895985 1341 andrew gelman stats-2012-05-24-Question 14 of my final exam for Design and Analysis of Sample Surveys

11 0.13691241 1345 andrew gelman stats-2012-05-26-Question 16 of my final exam for Design and Analysis of Sample Surveys

12 0.13204417 1642 andrew gelman stats-2012-12-28-New book by Stef van Buuren on missing-data imputation looks really good!

13 0.11489321 404 andrew gelman stats-2010-11-09-“Much of the recent reported drop in interstate migration is a statistical artifact”

14 0.11390574 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys

15 0.1106246 2260 andrew gelman stats-2014-03-22-Postdoc at Rennes on multilevel missing data imputation

16 0.10951585 524 andrew gelman stats-2011-01-19-Data exploration and multiple comparisons

17 0.10402176 1142 andrew gelman stats-2012-01-29-Difficulties with the 1-4-power transformation

18 0.098557457 1535 andrew gelman stats-2012-10-16-Bayesian analogue to stepwise regression?

19 0.098295383 848 andrew gelman stats-2011-08-11-That xkcd cartoon on multiple comparisons that all of you were sending me a couple months ago

20 0.095854342 1870 andrew gelman stats-2013-05-26-How to understand coefficients that reverse sign when you start controlling for things?


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.101), (1, 0.04), (2, 0.033), (3, -0.041), (4, 0.062), (5, 0.04), (6, -0.05), (7, -0.019), (8, 0.014), (9, 0.006), (10, 0.017), (11, -0.034), (12, 0.004), (13, 0.033), (14, 0.021), (15, 0.022), (16, -0.006), (17, -0.028), (18, 0.005), (19, -0.018), (20, 0.017), (21, 0.075), (22, 0.009), (23, 0.034), (24, -0.033), (25, -0.004), (26, 0.007), (27, -0.017), (28, 0.098), (29, 0.009), (30, 0.042), (31, -0.009), (32, 0.085), (33, 0.12), (34, -0.017), (35, -0.013), (36, 0.072), (37, 0.109), (38, -0.013), (39, 0.025), (40, -0.089), (41, 0.043), (42, 0.017), (43, -0.01), (44, -0.02), (45, -0.028), (46, -0.014), (47, 0.021), (48, -0.019), (49, -0.006)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9703142 608 andrew gelman stats-2011-03-12-Single or multiple imputation?

Introduction: Vishnu Ganglani writes: It appears that multiple imputation appears to be the best way to impute missing data because of the more accurate quantification of variance. However, when imputing missing data for income values in national household surveys, would you recommend it would be practical to maintain the multiple datasets associated with multiple imputations, or a single imputation method would suffice. I have worked on household survey projects (in Scotland) and in the past gone with suggesting single methods for ease of implementation, but with the availability of open source R software I am think of performing multiple imputation methodologies, but a bit apprehensive because of the complexity and also the need to maintain multiple datasets (ease of implementation). My reply: In many applications I’ve just used a single random imputation to avoid the awkwardness of working with multiple datasets. But if there’s any concern, I’d recommend doing parallel analyses on multipl

2 0.75498855 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

Introduction: Aureliano Crameri writes: I have questions regarding one technique you and your colleagues described in your papers: the cross validation (Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box, with reference to Gelman, King, and Liu, 1998). I think this is the technique I need for my purpose, but I am not sure I understand it right. I want to use the multiple imputation to estimate the outcome of psychotherapies based on longitudinal data. First I have to demonstrate that I am able to get unbiased estimates with the multiple imputation. The expected bias is the overestimation of the outcome of dropouts. I will test my imputation strategies by means of a series of simulations (delete values, impute, compare with the original). Due to the complexity of the statistical analyses I think I need at least 200 cases. Now I don’t have so many cases without any missings. My data have missing values in different variables. The proportion of missing values is

3 0.64829743 14 andrew gelman stats-2010-05-01-Imputing count data

Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.

4 0.61230612 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant

5 0.58099604 404 andrew gelman stats-2010-11-09-“Much of the recent reported drop in interstate migration is a statistical artifact”

Introduction: Greg Kaplan writes: I noticed that you have blogged a little about interstate migration trends in the US, and thought that you might be interested in a new working paper of mine (joint with Sam Schulhofer-Wohl from the Minneapolis Fed) which I have attached. Briefly, we show that much of the recent reported drop in interstate migration is a statistical artifact: The Census Bureau made an undocumented change in its imputation procedures for missing data in 2006, and this change significantly reduced the number of imputed interstate moves. The change in imputation procedures — not any actual change in migration behavior — explains 90 percent of the reported decrease in interstate migration between the 2005 and 2006 Current Population Surveys, and 42 percent of the decrease between 2000 and 2010. I haven’t had a chance to give a serious look so could only make the quick suggestion to make the graphs smaller and put multiple graphs on a page, This would allow the reader to bett

6 0.579952 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

7 0.57993913 580 andrew gelman stats-2011-02-19-Weather visualization with WeatherSpark

8 0.57256103 799 andrew gelman stats-2011-07-13-Hypothesis testing with multiple imputations

9 0.56180096 1978 andrew gelman stats-2013-08-12-Fixing the race, ethnicity, and national origin questions on the U.S. Census

10 0.55849987 704 andrew gelman stats-2011-05-10-Multiple imputation and multilevel analysis

11 0.55445892 1870 andrew gelman stats-2013-05-26-How to understand coefficients that reverse sign when you start controlling for things?

12 0.53619242 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?

13 0.53260392 1691 andrew gelman stats-2013-01-25-Extreem p-values!

14 0.52978456 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

15 0.51011944 527 andrew gelman stats-2011-01-20-Cars vs. trucks

16 0.50932497 1016 andrew gelman stats-2011-11-17-I got 99 comparisons but multiplicity ain’t one

17 0.50919408 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

18 0.50880659 1344 andrew gelman stats-2012-05-25-Question 15 of my final exam for Design and Analysis of Sample Surveys

19 0.50579894 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys

20 0.50039458 405 andrew gelman stats-2010-11-10-Estimation from an out-of-date census


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(15, 0.023), (16, 0.251), (20, 0.019), (24, 0.122), (29, 0.062), (76, 0.175), (86, 0.011), (89, 0.014), (99, 0.198)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.93538964 608 andrew gelman stats-2011-03-12-Single or multiple imputation?

Introduction: Vishnu Ganglani writes: It appears that multiple imputation appears to be the best way to impute missing data because of the more accurate quantification of variance. However, when imputing missing data for income values in national household surveys, would you recommend it would be practical to maintain the multiple datasets associated with multiple imputations, or a single imputation method would suffice. I have worked on household survey projects (in Scotland) and in the past gone with suggesting single methods for ease of implementation, but with the availability of open source R software I am think of performing multiple imputation methodologies, but a bit apprehensive because of the complexity and also the need to maintain multiple datasets (ease of implementation). My reply: In many applications I’ve just used a single random imputation to avoid the awkwardness of working with multiple datasets. But if there’s any concern, I’d recommend doing parallel analyses on multipl

2 0.86771899 1366 andrew gelman stats-2012-06-05-How do segregation measures change when you change the level of aggregation?

Introduction: In a discussion of workplace segregation, Philip Cohen posts some graphs that led me to a statistical question. I’ll pose my question below, but first the graphs: In a world of zero segregation of jobs by sex, the top graph above would have a spike at 50% (or, whatever the actual percentage is of women in the labor force) and, in the bottom graph, the pink and blue lines would be in the same place and would look like very steep S curves. The difference between the pink and blue lines represents segregation by job. One thing I wonder is how these graphs would change if we redefine occupation. (For example, is my occupation “mathematical scientist,” “statistician,” “teacher,” “university professor,” “statistics professor,” or “tenured statistics professor”?) Finer or coarser classification would give different results, and I wonder how this would work. This is not at all meant as a criticism of Cohen’s claims, it’s just a statistical question. I’m guessing that

3 0.86502802 2 andrew gelman stats-2010-04-23-Modeling heterogenous treatment effects

Introduction: Don Green and Holger Kern write on one of my favorite topics , treatment interactions (see also here ): We [Green and Kern] present a methodology that largely automates the search for systematic treatment effect heterogeneity in large-scale experiments. We introduce a nonparametric estimator developed in statistical learning, Bayesian Additive Regression Trees (BART), to model treatment effects that vary as a function of covariates. BART has several advantages over commonly employed parametric modeling strategies, in particular its ability to automatically detect and model relevant treatment-covariate interactions in a flexible manner. To increase the reliability and credibility of the resulting conditional treatment effect estimates, we suggest the use of a split sample design. The data are randomly divided into two equally-sized parts, with the first part used to explore treatment effect heterogeneity and the second part used to confirm the results. This approach permits a re

4 0.8640182 177 andrew gelman stats-2010-08-02-Reintegrating rebels into civilian life: Quasi-experimental evidence from Burundi

Introduction: Michael Gilligan, Eric Mvukiyehe, and Cyrus Samii write : We [Gilligan, Mvukiyehe, and Samii] use original survey data, collected in Burundi in the summer of 2007, to show that a World Bank ex-combatant reintegration program implemented after Burundi’s civil war caused significant economic reintegration for its beneficiaries but that this economic reintegration did not translate into greater political and social reintegration. Previous studies of reintegration programs have found them to be ineffective, but these studies have suffered from selection bias: only ex-combatants who self selected into those programs were studied. We avoid such bias with a quasi-experimental research design made possible by an exogenous bureaucratic failure in the implementation of program. One of the World Bank’s implementing partners delayed implementation by almost a year due to an unforeseen contract dispute. As a result, roughly a third of ex-combatants had their program benefits withheld for reas

5 0.86098313 1025 andrew gelman stats-2011-11-24-Always check your evidence

Introduction: Logical reasoning typically takes the following form: 1. I know that A is true. 2. I know that A implies B. 3. Therefore, I can conclude that B is true. I, like Lewis Carroll, have problems with this process sometimes, but it’s pretty standard. There is also a statistical version in which the above statements are replaced by averages (“A usually happens,” etc.). But in all these stories, the argument can fall down if you get the facts wrong. Perhaps that’s one reason that statisticians can be obsessed with detail. For example, David Brooks wrote the following, in a column called “Living with Mistakes”: The historian Leslie Hannah identified the ten largest American companies in 1912. None of those companies ranked in the top 100 companies by 1990. Huh? Could that really be? I googled “ten largest american companies 1912″ and found this , from Leslie Hannah: No big deal: two still in the top 10 rather than zero in the top 100, but Brooks’s general

6 0.85803306 1279 andrew gelman stats-2012-04-24-ESPN is looking to hire a research analyst

7 0.85744655 700 andrew gelman stats-2011-05-06-Suspicious pattern of too-strong replications of medical research

8 0.85410255 609 andrew gelman stats-2011-03-13-Coauthorship norms

9 0.85375804 1598 andrew gelman stats-2012-11-30-A graphics talk with no visuals!

10 0.85338688 1156 andrew gelman stats-2012-02-06-Bayesian model-building by pure thought: Some principles and examples

11 0.85056001 1487 andrew gelman stats-2012-09-08-Animated drought maps

12 0.84992838 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

13 0.84467292 1093 andrew gelman stats-2011-12-30-Strings Attached: Untangling the Ethics of Incentives

14 0.84298921 411 andrew gelman stats-2010-11-13-Ethical concerns in medical trials

15 0.83820504 321 andrew gelman stats-2010-10-05-Racism!

16 0.83533102 445 andrew gelman stats-2010-12-03-Getting a job in pro sports… as a statistician

17 0.83274925 1928 andrew gelman stats-2013-07-06-How to think about papers published in low-grade journals?

18 0.83169526 960 andrew gelman stats-2011-10-15-The bias-variance tradeoff

19 0.83162022 377 andrew gelman stats-2010-10-28-The incoming moderate Republican congressmembers

20 0.82869864 1180 andrew gelman stats-2012-02-22-I’m officially no longer a “rogue”