andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1330 knowledge-graph by maker-knowledge-mining

1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation


meta infos for this blog

Source: html

Introduction: Aureliano Crameri writes: I have questions regarding one technique you and your colleagues described in your papers: the cross validation (Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box, with reference to Gelman, King, and Liu, 1998). I think this is the technique I need for my purpose, but I am not sure I understand it right. I want to use the multiple imputation to estimate the outcome of psychotherapies based on longitudinal data. First I have to demonstrate that I am able to get unbiased estimates with the multiple imputation. The expected bias is the overestimation of the outcome of dropouts. I will test my imputation strategies by means of a series of simulations (delete values, impute, compare with the original). Due to the complexity of the statistical analyses I think I need at least 200 cases. Now I don’t have so many cases without any missings. My data have missing values in different variables. The proportion of missing values is


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Aureliano Crameri writes: I have questions regarding one technique you and your colleagues described in your papers: the cross validation (Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box, with reference to Gelman, King, and Liu, 1998). [sent-1, score-0.496]

2 I think this is the technique I need for my purpose, but I am not sure I understand it right. [sent-2, score-0.154]

3 I want to use the multiple imputation to estimate the outcome of psychotherapies based on longitudinal data. [sent-3, score-0.685]

4 First I have to demonstrate that I am able to get unbiased estimates with the multiple imputation. [sent-4, score-0.213]

5 The expected bias is the overestimation of the outcome of dropouts. [sent-5, score-0.259]

6 I will test my imputation strategies by means of a series of simulations (delete values, impute, compare with the original). [sent-6, score-0.64]

7 Due to the complexity of the statistical analyses I think I need at least 200 cases. [sent-7, score-0.073]

8 The proportion of missing values is lower than 30%. [sent-10, score-0.361]

9 I set up the chained equations and generate 30-40 imputations. [sent-12, score-0.535]

10 Next I fill in the missings with the mean of the imputed values. [sent-13, score-0.411]

11 Among others I delete the outcome of 30 successful and 30 unsuccessful cases. [sent-15, score-0.703]

12 Then I impute again with the same regression equations used before. [sent-16, score-0.438]

13 Finally I pool the results according to the Rubin’s rules and I compare the pooled coefficients with the coefficients generated with the repaired dataset. [sent-17, score-0.836]

14 So I can test if the chained equations can discriminate between successful und unsuccessful courses of therapy. [sent-18, score-1.051]

15 Does such a proceeding correspond to a cross validation in your sense? [sent-19, score-0.531]

16 When you do your first imputation, don’t fill in missing entries with the mean of the imputed values. [sent-22, score-0.63]

17 Just use one of the completed datasets (with random imputations) that you created. [sent-23, score-0.239]

18 You can do the whole thing 10 times with different random imputations, but that probably won’t make a difference. [sent-24, score-0.09]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('imputation', 0.298), ('equations', 0.243), ('repaired', 0.237), ('chained', 0.224), ('unsuccessful', 0.207), ('imputations', 0.207), ('delete', 0.207), ('impute', 0.195), ('validation', 0.18), ('imputed', 0.177), ('outcome', 0.166), ('fill', 0.166), ('cross', 0.162), ('technique', 0.154), ('missing', 0.15), ('values', 0.148), ('multiple', 0.136), ('successful', 0.123), ('coefficients', 0.12), ('proceeding', 0.112), ('compare', 0.109), ('mi', 0.095), ('discriminate', 0.093), ('pooled', 0.093), ('overestimation', 0.093), ('diagnostics', 0.092), ('random', 0.09), ('test', 0.088), ('longitudinal', 0.085), ('proceed', 0.084), ('liu', 0.084), ('opening', 0.084), ('windows', 0.083), ('generated', 0.08), ('completed', 0.079), ('strategies', 0.077), ('pool', 0.077), ('correspond', 0.077), ('unbiased', 0.077), ('complexity', 0.073), ('courses', 0.073), ('king', 0.072), ('box', 0.07), ('datasets', 0.07), ('entries', 0.069), ('simulations', 0.068), ('generate', 0.068), ('mean', 0.068), ('black', 0.064), ('proportion', 0.063)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

Introduction: Aureliano Crameri writes: I have questions regarding one technique you and your colleagues described in your papers: the cross validation (Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box, with reference to Gelman, King, and Liu, 1998). I think this is the technique I need for my purpose, but I am not sure I understand it right. I want to use the multiple imputation to estimate the outcome of psychotherapies based on longitudinal data. First I have to demonstrate that I am able to get unbiased estimates with the multiple imputation. The expected bias is the overestimation of the outcome of dropouts. I will test my imputation strategies by means of a series of simulations (delete values, impute, compare with the original). Due to the complexity of the statistical analyses I think I need at least 200 cases. Now I don’t have so many cases without any missings. My data have missing values in different variables. The proportion of missing values is

2 0.31310293 608 andrew gelman stats-2011-03-12-Single or multiple imputation?

Introduction: Vishnu Ganglani writes: It appears that multiple imputation appears to be the best way to impute missing data because of the more accurate quantification of variance. However, when imputing missing data for income values in national household surveys, would you recommend it would be practical to maintain the multiple datasets associated with multiple imputations, or a single imputation method would suffice. I have worked on household survey projects (in Scotland) and in the past gone with suggesting single methods for ease of implementation, but with the availability of open source R software I am think of performing multiple imputation methodologies, but a bit apprehensive because of the complexity and also the need to maintain multiple datasets (ease of implementation). My reply: In many applications I’ve just used a single random imputation to avoid the awkwardness of working with multiple datasets. But if there’s any concern, I’d recommend doing parallel analyses on multipl

3 0.24751659 799 andrew gelman stats-2011-07-13-Hypothesis testing with multiple imputations

Introduction: Vincent Yip writes: I have read your paper [with Kobi Abayomi and Marc Levy] regarding multiple imputation application. In order to diagnostic my imputed data, I used Kolmogorov-Smirnov (K-S) tests to compare the distribution differences between the imputed and observed values of a single attribute as mentioned in your paper. My question is: For example I have this attribute X with the following data: (NA = missing) Original dataset: 1, NA, 3, 4, 1, 5, NA Imputed dataset: 1, 2 , 3, 4, 1, 5, 6 a) in order to run the KS test, will I treat the observed data as 1, 3, 4,1, 5? b) and for the observed data, will I treat 1, 2 , 3, 4, 1, 5, 6 as the imputed dataset for the K-S test? or just 2 ,6? c) if I used m=5, I will have 5 set of imputed data sets. How would I apply K-S test to 5 of them and compare to the single observed distribution? Do I combine the 5 imputed data set into one by averaging each imputed values so I get one single imputed data and compare with the ob

4 0.24493824 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?

Introduction: Majid Ezzati writes: My research group is increasingly focusing on a series of problems that involve data that either have missingness or measurements that may have bias/error. We have at times developed our own approaches to imputation (as simple as interpolating a missing unit and as sophisticated as a problem-specific Bayesian hierarchical model) and at other times, other groups impute the data. The outputs are being used to investigate the basic associations between pairs of variables, Xs and Ys, in regressions; we may or may not interpret these as causal. I am contacting colleagues with relevant expertise to suggest good references on whether having imputed X and/or Y in a subsequent regression is correct or if it could somehow lead to biased/spurious associations. Thinking about this, we can have at least the following situations (these could all be Bayesian or not): 1) X and Y both measured (perhaps with error) 2) Y imputed using some data and a model and X measur

5 0.18339148 1344 andrew gelman stats-2012-05-25-Question 15 of my final exam for Design and Analysis of Sample Surveys

Introduction: 15. A researcher conducts a random-digit-dial survey of individuals and married couples. The design is as follows: if only one person lives in a household, he or she is interviewed. If there are multiple adults in the household, one is selected at random: he or she is interviewed and, if he or she is married to one of the other adults in the household, the spouse is interviewed as well. Come up with a scheme for inverse-probability weights (ignoring nonresponse and assuming there is exactly one phone line per household). Solution to question 14 From yesterday : 14. A public health survey of elderly Americans includes many questions, including “How many hours per week did you exercise in your most active years as a young adult?” and also several questions about current mobility and health status. Response rates are high for the questions about recent activities and status, but there is a lot of nonresponse for the question on past activity. You are considering imputing the mis

6 0.16145864 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

7 0.13962997 1019 andrew gelman stats-2011-11-19-Validation of Software for Bayesian Models Using Posterior Quantiles

8 0.13767579 704 andrew gelman stats-2011-05-10-Multiple imputation and multilevel analysis

9 0.13596541 1341 andrew gelman stats-2012-05-24-Question 14 of my final exam for Design and Analysis of Sample Surveys

10 0.13386521 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

11 0.10362935 817 andrew gelman stats-2011-07-23-New blog home

12 0.098622531 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys

13 0.096678048 2170 andrew gelman stats-2014-01-13-Judea Pearl overview on causal inference, and more general thoughts on the reexpression of existing methods by considering their implicit assumptions

14 0.093987599 404 andrew gelman stats-2010-11-09-“Much of the recent reported drop in interstate migration is a statistical artifact”

15 0.093622833 2260 andrew gelman stats-2014-03-22-Postdoc at Rennes on multilevel missing data imputation

16 0.090792432 1142 andrew gelman stats-2012-01-29-Difficulties with the 1-4-power transformation

17 0.086781397 537 andrew gelman stats-2011-01-25-Postdoc Position #1: Missing-Data Imputation, Diagnostics, and Applications

18 0.08589749 1642 andrew gelman stats-2012-12-28-New book by Stef van Buuren on missing-data imputation looks really good!

19 0.085052349 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

20 0.083828844 2086 andrew gelman stats-2013-11-03-How best to compare effects measured in two different time periods?


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.128), (1, 0.048), (2, 0.049), (3, -0.054), (4, 0.065), (5, 0.028), (6, -0.008), (7, -0.026), (8, 0.042), (9, 0.008), (10, 0.013), (11, -0.001), (12, 0.018), (13, -0.006), (14, 0.016), (15, 0.03), (16, -0.012), (17, -0.018), (18, 0.002), (19, -0.013), (20, 0.008), (21, 0.058), (22, 0.025), (23, 0.012), (24, 0.032), (25, 0.004), (26, 0.0), (27, -0.038), (28, 0.059), (29, 0.004), (30, 0.071), (31, -0.011), (32, 0.053), (33, 0.11), (34, 0.002), (35, -0.014), (36, 0.04), (37, 0.081), (38, 0.023), (39, -0.014), (40, -0.07), (41, -0.023), (42, -0.005), (43, -0.006), (44, -0.029), (45, 0.018), (46, 0.038), (47, 0.009), (48, 0.012), (49, 0.011)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95316607 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

Introduction: Aureliano Crameri writes: I have questions regarding one technique you and your colleagues described in your papers: the cross validation (Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box, with reference to Gelman, King, and Liu, 1998). I think this is the technique I need for my purpose, but I am not sure I understand it right. I want to use the multiple imputation to estimate the outcome of psychotherapies based on longitudinal data. First I have to demonstrate that I am able to get unbiased estimates with the multiple imputation. The expected bias is the overestimation of the outcome of dropouts. I will test my imputation strategies by means of a series of simulations (delete values, impute, compare with the original). Due to the complexity of the statistical analyses I think I need at least 200 cases. Now I don’t have so many cases without any missings. My data have missing values in different variables. The proportion of missing values is

2 0.80648309 608 andrew gelman stats-2011-03-12-Single or multiple imputation?

Introduction: Vishnu Ganglani writes: It appears that multiple imputation appears to be the best way to impute missing data because of the more accurate quantification of variance. However, when imputing missing data for income values in national household surveys, would you recommend it would be practical to maintain the multiple datasets associated with multiple imputations, or a single imputation method would suffice. I have worked on household survey projects (in Scotland) and in the past gone with suggesting single methods for ease of implementation, but with the availability of open source R software I am think of performing multiple imputation methodologies, but a bit apprehensive because of the complexity and also the need to maintain multiple datasets (ease of implementation). My reply: In many applications I’ve just used a single random imputation to avoid the awkwardness of working with multiple datasets. But if there’s any concern, I’d recommend doing parallel analyses on multipl

3 0.77481884 14 andrew gelman stats-2010-05-01-Imputing count data

Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.

4 0.71473426 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant

5 0.69608414 799 andrew gelman stats-2011-07-13-Hypothesis testing with multiple imputations

Introduction: Vincent Yip writes: I have read your paper [with Kobi Abayomi and Marc Levy] regarding multiple imputation application. In order to diagnostic my imputed data, I used Kolmogorov-Smirnov (K-S) tests to compare the distribution differences between the imputed and observed values of a single attribute as mentioned in your paper. My question is: For example I have this attribute X with the following data: (NA = missing) Original dataset: 1, NA, 3, 4, 1, 5, NA Imputed dataset: 1, 2 , 3, 4, 1, 5, 6 a) in order to run the KS test, will I treat the observed data as 1, 3, 4,1, 5? b) and for the observed data, will I treat 1, 2 , 3, 4, 1, 5, 6 as the imputed dataset for the K-S test? or just 2 ,6? c) if I used m=5, I will have 5 set of imputed data sets. How would I apply K-S test to 5 of them and compare to the single observed distribution? Do I combine the 5 imputed data set into one by averaging each imputed values so I get one single imputed data and compare with the ob

6 0.68914574 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

7 0.67336017 527 andrew gelman stats-2011-01-20-Cars vs. trucks

8 0.66125679 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

9 0.64206845 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

10 0.64005136 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

11 0.63935983 1344 andrew gelman stats-2012-05-25-Question 15 of my final exam for Design and Analysis of Sample Surveys

12 0.63728011 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

13 0.62866849 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

14 0.62773848 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

15 0.60566396 553 andrew gelman stats-2011-02-03-is it possible to “overstratify” when assigning a treatment in a randomized control trial?

16 0.60507113 777 andrew gelman stats-2011-06-23-Combining survey data obtained using different modes of sampling

17 0.59969842 1870 andrew gelman stats-2013-05-26-How to understand coefficients that reverse sign when you start controlling for things?

18 0.59835243 569 andrew gelman stats-2011-02-12-Get the Data

19 0.59812474 212 andrew gelman stats-2010-08-17-Futures contracts, Granger causality, and my preference for estimation to testing

20 0.59608614 627 andrew gelman stats-2011-03-24-How few respondents are reasonable to use when calculating the average by county?


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(13, 0.033), (16, 0.486), (21, 0.024), (24, 0.037), (29, 0.022), (57, 0.017), (72, 0.016), (86, 0.028), (89, 0.018), (94, 0.011), (99, 0.207)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.99048811 1014 andrew gelman stats-2011-11-16-Visualizations of NYPD stop-and-frisk data

Introduction: Cathy O’Neil organized this visualization project with NYPD stop-and-frisk data. It’s part of the Data Without Borders project. Unfortunately, because of legal restrictions I couldn’t send them the data Jeff, Alex, and I used in our project several years ago.

2 0.98765254 572 andrew gelman stats-2011-02-14-Desecration of valuable real estate

Introduction: Malecki asks: Is this the worst infographic ever to appear in NYT? USA Today is not something to aspire to. To connect to some of our recent themes , I agree this is a pretty horrible data display. But it’s not bad as a series of images. Considering the competition to be a cartoon or series of photos, these images aren’t so bad. One issue, I think, is that designers get credit for creativity and originality (unusual color combinations! Histogram bars shaped like mosques!) , which is often the opposite of what we want in a clear graph. It’s Martin Amis vs. George Orwell all over again.

3 0.98616213 1115 andrew gelman stats-2012-01-12-Where are the larger-than-life athletes?

Introduction: Jonathan Cantor points to this poll estimating rifle-armed QB Tim Tebow as America’s favorite pro athlete: In an ESPN survey of 1,502 Americans age 12 or older, three percent identified Tebow as their favorite professional athlete. Tebow finished in front of Kobe Bryant (2 percent), Aaron Rodgers (1.9 percent), Peyton Manning (1.8 percent), and Tom Brady (1.5 percent). Amusing. What this survey says to me is that there are no super-popular athletes who are active in America today. Which actually sounds about right. No Tiger Woods, no Magic Johnson, Muhammed Ali, John Elway, Pete Rose, Billie Jean King, etc etc. Tebow is an amusing choice, people might as well pick him now while he’s still on top. As a sports celeb, he’s like Bill Lee or the Refrigerator: colorful and a solid pro athlete, but no superstar. When you think about all the colorful superstar athletes of times gone by, it’s perhaps surprising that there’s nobody out there right now to play the role. I supp

4 0.9829337 528 andrew gelman stats-2011-01-21-Elevator shame is a two-way street

Introduction: Tyler Cowen links a blog by Samuel Arbesman mocking people who are so lazy that they take the elevator from 1 to 2. This reminds me of my own annoyance about a guy who worked in my building and did not take the elevator. (For the full story, go here and search on “elevator.”)

5 0.98009008 1659 andrew gelman stats-2013-01-07-Some silly things you (didn’t) miss by not reading the sister blog

Introduction: 1. I have the least stressful job in America (duh) 2. B-school prof in a parody of short-term thinking 3. The academic clock 4. I guessed wrong 5. 2012 Conceptual Development Lab Newsletter

6 0.96814525 1304 andrew gelman stats-2012-05-06-Picking on Stephen Wolfram

7 0.9594903 1180 andrew gelman stats-2012-02-22-I’m officially no longer a “rogue”

8 0.95754617 1279 andrew gelman stats-2012-04-24-ESPN is looking to hire a research analyst

9 0.95724493 1026 andrew gelman stats-2011-11-25-Bayes wikipedia update

10 0.95667511 1366 andrew gelman stats-2012-06-05-How do segregation measures change when you change the level of aggregation?

11 0.94447029 398 andrew gelman stats-2010-11-06-Quote of the day

12 0.94117779 1487 andrew gelman stats-2012-09-08-Animated drought maps

same-blog 13 0.93873096 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

14 0.9269774 445 andrew gelman stats-2010-12-03-Getting a job in pro sports… as a statistician

15 0.91826892 1025 andrew gelman stats-2011-11-24-Always check your evidence

16 0.91746682 1598 andrew gelman stats-2012-11-30-A graphics talk with no visuals!

17 0.91169512 1745 andrew gelman stats-2013-03-02-Classification error

18 0.90162265 700 andrew gelman stats-2011-05-06-Suspicious pattern of too-strong replications of medical research

19 0.88834786 1156 andrew gelman stats-2012-02-06-Bayesian model-building by pure thought: Some principles and examples

20 0.8829127 1168 andrew gelman stats-2012-02-14-The tabloids strike again