andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-14 knowledge-graph by maker-knowledge-mining

14 andrew gelman stats-2010-05-01-Imputing count data


meta infos for this blog

Source: html

Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Guy asks: I am analyzing an original survey of farmers in Uganda. [sent-1, score-0.39]

2 I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. [sent-2, score-1.398]

3 I have quick question which I hope you can find time to address: How do you recommend treating count data? [sent-3, score-0.465]

4 In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). [sent-5, score-0.98]

5 In the case of # of cows or chickens several obs have values in the hundreds. [sent-6, score-0.975]

6 My response: Here’s what we do in our mi package in R. [sent-7, score-0.244]

7 We split a variable into two parts: an indicator for whether it is positive, and the positive part. [sent-8, score-0.465]

8 Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. [sent-10, score-0.863]

9 At the end you can round to the nearest integer if you want to avoid fractional values. [sent-11, score-0.755]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('cows', 0.331), ('chickens', 0.319), ('welfare', 0.248), ('modeled', 0.234), ('log', 0.216), ('battery', 0.183), ('integer', 0.183), ('obs', 0.173), ('rooms', 0.173), ('farmers', 0.16), ('values', 0.152), ('fractional', 0.151), ('mi', 0.148), ('nearest', 0.148), ('positive', 0.144), ('proxy', 0.142), ('skewed', 0.142), ('treating', 0.137), ('variables', 0.134), ('problematic', 0.131), ('indicator', 0.127), ('round', 0.127), ('split', 0.119), ('index', 0.109), ('binary', 0.103), ('hoping', 0.103), ('count', 0.102), ('analyzing', 0.1), ('responses', 0.098), ('package', 0.096), ('dataset', 0.095), ('address', 0.095), ('parts', 0.093), ('asks', 0.089), ('create', 0.088), ('avoid', 0.084), ('recommend', 0.08), ('highly', 0.08), ('using', 0.076), ('natural', 0.076), ('quick', 0.075), ('variable', 0.075), ('guy', 0.074), ('zero', 0.073), ('hope', 0.071), ('single', 0.067), ('original', 0.066), ('taking', 0.066), ('survey', 0.064), ('end', 0.062)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 14 andrew gelman stats-2010-05-01-Imputing count data

Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.

2 0.16352396 1664 andrew gelman stats-2013-01-10-Recently in the sister blog: Brussels sprouts, ugly graphs, and switched at birth

Introduction: 1. Congress vs. Nickelback: The real action is in the cross tabs : Conservatives are mean, liberals are big babies, and, if supporting an STD is what it takes to be a political moderate, I don’t want to be one. 2. How 2012 stacks up: The worst graph on record? : OK, not actually worse than this one . 3. Boys will be boys; cows will be cows : Children’s essentialist reasoning about gender categories and animal species.

3 0.16058698 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

Introduction: Steve Miller writes: Much of what I do is cross-national analyses of survey data (largely World Values Survey). . . . My big question pertains to (what I would call) exploratory analysis of multilevel data, especially when the group-level predictors are of theoretical importance. A lot of what I do involves analyzing cross-national survey items of citizen attitudes, typically of political leadership. These survey items are usually yes/no responses, or four-part responses indicating a level of agreement (strongly agree, agree, disagree, strongly disagree) that can be condensed into a binary variable. I believe these can be explained by reference to country-level factors. Much of the group-level variables of interest are count variables with a modal value of 0, which can be quite messy. How would you recommend exploring the variation in the dependent variable as it could be explained by the group-level count variable of interest, before fitting the multilevel model itself? When

4 0.12277327 2296 andrew gelman stats-2014-04-19-Index or indicator variables

Introduction: Someone who doesn’t want his name shared (for the perhaps reasonable reason that he’ll “one day not be confused, and would rather my confusion not live on online forever”) writes: I’m exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally clear grasp on the material, but wanted to be sure I haven’t gone astray. The problem in working on involves a multi-nation survey of students, and I’m especially interested in understanding the effects of country, religion, and sex, and the interactions among those factors (using IRT to estimate individual-level ability, then estimating individual, school, and country effects). Following the basic approach laid out in chapter 13 for such interactions between levels, I think I need to create a matrix of indicator variables for religion and sex. Elsewhere in the book, you recommend against indicator variables in favor of a single index variable. Am I right in thinking t

5 0.092588842 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant

6 0.086800158 196 andrew gelman stats-2010-08-10-The U.S. as welfare state

7 0.0866201 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

8 0.085405029 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

9 0.081640624 25 andrew gelman stats-2010-05-10-Two great tastes that taste great together

10 0.080957025 1645 andrew gelman stats-2012-12-31-Statistical modeling, causal inference, and social science

11 0.080750018 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

12 0.074595176 1527 andrew gelman stats-2012-10-10-Another reason why you can get good inferences from a bad model

13 0.07344088 1761 andrew gelman stats-2013-03-13-Lame Statistics Patents

14 0.071824625 1682 andrew gelman stats-2013-01-19-R package for Bayes factors

15 0.070805982 2176 andrew gelman stats-2014-01-19-Transformations for non-normal data

16 0.068867758 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?

17 0.067582369 1142 andrew gelman stats-2012-01-29-Difficulties with the 1-4-power transformation

18 0.067078784 1228 andrew gelman stats-2012-03-25-Continuous variables in Bayesian networks

19 0.064061947 2290 andrew gelman stats-2014-04-14-On deck this week

20 0.063629277 1134 andrew gelman stats-2012-01-21-Lessons learned from a recent R package submission


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.095), (1, 0.041), (2, 0.061), (3, -0.029), (4, 0.077), (5, 0.029), (6, 0.004), (7, -0.028), (8, 0.045), (9, 0.023), (10, 0.015), (11, -0.013), (12, 0.011), (13, 0.001), (14, 0.019), (15, 0.009), (16, -0.004), (17, -0.001), (18, 0.021), (19, -0.02), (20, -0.003), (21, 0.037), (22, 0.013), (23, 0.001), (24, -0.005), (25, 0.011), (26, 0.038), (27, -0.027), (28, 0.025), (29, 0.001), (30, 0.026), (31, 0.047), (32, 0.028), (33, 0.038), (34, -0.014), (35, -0.045), (36, -0.001), (37, 0.05), (38, -0.031), (39, -0.0), (40, -0.013), (41, -0.013), (42, 0.041), (43, 0.005), (44, 0.002), (45, 0.032), (46, 0.018), (47, 0.025), (48, 0.015), (49, 0.035)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97625554 14 andrew gelman stats-2010-05-01-Imputing count data

Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.

2 0.83057922 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant

3 0.79808253 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance

Introduction: Steve Miller writes: Much of what I do is cross-national analyses of survey data (largely World Values Survey). . . . My big question pertains to (what I would call) exploratory analysis of multilevel data, especially when the group-level predictors are of theoretical importance. A lot of what I do involves analyzing cross-national survey items of citizen attitudes, typically of political leadership. These survey items are usually yes/no responses, or four-part responses indicating a level of agreement (strongly agree, agree, disagree, strongly disagree) that can be condensed into a binary variable. I believe these can be explained by reference to country-level factors. Much of the group-level variables of interest are count variables with a modal value of 0, which can be quite messy. How would you recommend exploring the variation in the dependent variable as it could be explained by the group-level count variable of interest, before fitting the multilevel model itself? When

4 0.78711241 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

Introduction: Fred Schiff writes: I’m writing to you to ask about the “R-squared” approximation procedure you suggest in your 2004 book with Dr. Hill. [See also this paper with Pardoe---ed.] I’m a media sociologist at the University of Houston. I’ve been using HLM3 for about two years. Briefly about my data. It’s a content analysis of news stories with a continuous scale dependent variable, story prominence. I have 6090 news stories, 114 newspapers, and 59 newspaper group owners. All the Level-1, Level-2 and dependent variables have been standardized. Since the means were zero anyway, we left the variables uncentered. All the Level-3 ownership groups and characteristics are dichotomous scales that were left uncentered. PROBLEM: The single most important result I am looking for is to compare the strength of nine competing Level-1 variables in their ability to predict and explain the outcome variable, story prominence. We are trying to use the residuals to calculate a “R-squ

5 0.76688868 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

Introduction: Aureliano Crameri writes: I have questions regarding one technique you and your colleagues described in your papers: the cross validation (Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box, with reference to Gelman, King, and Liu, 1998). I think this is the technique I need for my purpose, but I am not sure I understand it right. I want to use the multiple imputation to estimate the outcome of psychotherapies based on longitudinal data. First I have to demonstrate that I am able to get unbiased estimates with the multiple imputation. The expected bias is the overestimation of the outcome of dropouts. I will test my imputation strategies by means of a series of simulations (delete values, impute, compare with the original). Due to the complexity of the statistical analyses I think I need at least 200 cases. Now I don’t have so many cases without any missings. My data have missing values in different variables. The proportion of missing values is

6 0.75353146 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

7 0.72770506 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c

8 0.72709769 553 andrew gelman stats-2011-02-03-is it possible to “overstratify” when assigning a treatment in a randomized control trial?

9 0.72204238 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

10 0.71992445 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

11 0.71166271 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

12 0.70471287 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

13 0.69893515 2296 andrew gelman stats-2014-04-19-Index or indicator variables

14 0.69727802 1761 andrew gelman stats-2013-03-13-Lame Statistics Patents

15 0.69489789 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

16 0.6911369 2152 andrew gelman stats-2013-12-28-Using randomized incentives as an instrument for survey nonresponse?

17 0.67395073 251 andrew gelman stats-2010-09-02-Interactions of predictors in a causal model

18 0.66687429 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

19 0.66393632 2357 andrew gelman stats-2014-06-02-Why we hate stepwise regression

20 0.66035503 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.019), (9, 0.044), (11, 0.019), (16, 0.09), (21, 0.017), (24, 0.096), (37, 0.021), (53, 0.014), (56, 0.207), (64, 0.014), (82, 0.02), (89, 0.034), (94, 0.02), (98, 0.045), (99, 0.232)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.93419528 14 andrew gelman stats-2010-05-01-Imputing count data

Introduction: Guy asks: I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address: How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds. My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.

2 0.93217373 1045 andrew gelman stats-2011-12-07-Martyn Plummer’s Secret JAGS Blog

Introduction: Martyn Plummer , the creator of the open-source, C++, graphical-model compiler JAGS (aka “Just Another Gibbs Sampler”), runs a forum on the JAGS site that has a very similar feel to the mail-bag posts on this blog. Martyn answers general statistical computing questions (e.g., why slice sampling rather than Metropolis-Hastings?) and general modeling (e.g., why won’t my model converge with this prior?). Here’s the link to the top-level JAGS site, and to the forum: JAGS Forum JAGS Home Page The forum’s pretty active, with the stats page showing hundreds of views per day and very regular posts and answers. Martyn’s last post was today. Martyn also has a blog devoted to JAGS and other stats news: JAGS News Blog

3 0.91197848 1011 andrew gelman stats-2011-11-15-World record running times vs. distance

Introduction: Julyan Arbel plots world record running times vs. distance (on the log-log scale): The line has a slope of 1.1. I think it would be clearer to plot speed vs. distance—then you’d get a slope of -0.1, and the numbers would be more directly interpretable. Indeed, this paper by Sandra Savaglio and Vincenzo Carbone (referred to in the comments on Julyan’s blog) plots speed vs. time. Graphing by speed gives more resolution: The upper-left graph in the grid corresponds to the human running records plotted by Arbel. It’s funny that Arbel sees only one line whereas Savaglio and Carbone see two—but if you remove the 100m record at one end and the 100km at the other end, you can see two lines in Arbel’s graph as well. The bottom two graphs show swimming records. Knut would probably have something to say about all this.

4 0.89073789 1929 andrew gelman stats-2013-07-07-Stereotype threat!

Introduction: Colleen Ganley, Leigh Mingle, Allison Ryan, Katherine Ryan, Marian Vasilyeva, and Michelle Perry write : Stereotype threat has been proposed as 1 potential explanation for the gender difference in standardized mathematics test performance among high-performing students. At present, it is not entirely clear how susceptibility to stereotype threat develops, as empirical evidence for stereotype threat effects across the school years is inconsistent. In a series of 3 studies, with a total sample of 931 students, we investigated stereotype threat effects during childhood and adolescence. Three activation methods were used, ranging from implicit to explicit. Across studies, we found no evidence that the mathematics performance of school-age girls was impacted by stereotype threat. In 2 of the studies, there were gender differences on the mathematics assessment regardless of whether stereotype threat was activated. Potential reasons for these findings are discussed, including the possibil

5 0.88293815 933 andrew gelman stats-2011-09-30-More bad news: The (mis)reporting of statistical results in psychology journals

Introduction: Another entry in the growing literature on systematic flaws in the scientific research literature. This time the bad tidings come from Marjan Bakker and Jelte Wicherts, who write : Around 18% of statistical results in the psychological literature are incorrectly reported. Inconsistencies were more common in low-impact journals than in high-impact journals. Moreover, around 15% of the articles contained at least one statistical conclusion that proved, upon recalculation, to be incorrect; that is, recalculation rendered the previously significant result insignificant, or vice versa. These errors were often in line with researchers’ expectations. Their research also had a qualitative component: To obtain a better understanding of the origins of the errors made in the reporting of statistics, we contacted the authors of the articles with errors in the second study and asked them to send us the raw data. Regrettably, only 24% of the authors shared their data, despite our request

6 0.87714469 1158 andrew gelman stats-2012-02-07-The more likely it is to be X, the more likely it is to be Not X?

7 0.86201137 267 andrew gelman stats-2010-09-09-This Friday afternoon: Applied Statistics Center mini-conference on risk perception

8 0.86147809 1054 andrew gelman stats-2011-12-12-More frustrations trying to replicate an analysis published in a reputable journal

9 0.85301721 1388 andrew gelman stats-2012-06-22-Americans think economy isn’t so bad in their city but is crappy nationally and globally

10 0.84906816 780 andrew gelman stats-2011-06-27-Bridges between deterministic and probabilistic models for binary data

11 0.83013964 984 andrew gelman stats-2011-11-01-David MacKay sez . . . 12??

12 0.82401276 24 andrew gelman stats-2010-05-09-Special journal issue on statistical methods for the social sciences

13 0.79765713 534 andrew gelman stats-2011-01-24-Bayes at the end

14 0.79264724 2248 andrew gelman stats-2014-03-15-Problematic interpretations of confidence intervals

15 0.79255384 2134 andrew gelman stats-2013-12-14-Oswald evidence

16 0.79102492 2240 andrew gelman stats-2014-03-10-On deck this week: Things people sent me

17 0.78452051 630 andrew gelman stats-2011-03-27-What is an economic “conspiracy theory”?

18 0.78081352 426 andrew gelman stats-2010-11-22-Postdoc opportunity here at Columbia — deadline soon!

19 0.78003478 1162 andrew gelman stats-2012-02-11-Adding an error model to a deterministic model

20 0.77130795 2234 andrew gelman stats-2014-03-05-Plagiarism, Arizona style