andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1142 knowledge-graph by maker-knowledge-mining

1142 andrew gelman stats-2012-01-29-Difficulties with the 1-4-power transformation


meta infos for this blog

Source: html

Introduction: John Hayes writes: I am a fan of the quarter root transform ever since reading about it on your blog . However, today my student and I hit a wall that I’m hoping you might have some insight on. By training, I am a psychophysicist (think SS Stevens), and people in my field often log transform data prior to analysis. However, this data frequently contains zeros, so I’ve tried using quarter root transforms to get around this. But until today, I had never tried to back transform the plot axis for readability. I assumed this would be straightforward – alas it is not. Specifically, we quarter root transformed our data, performed an ANOVA, got what we thought was a reasonable effect, and then plotted the data. So far so good. However, the LS means in question are below 1, meaning that raising them to the 4th power just makes them smaller, and uninterpretable in the original metric. Do you have any thoughts or insights you might share? My reply: I don’t see the problem with pre


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 John Hayes writes: I am a fan of the quarter root transform ever since reading about it on your blog . [sent-1, score-1.129]

2 However, today my student and I hit a wall that I’m hoping you might have some insight on. [sent-2, score-0.444]

3 By training, I am a psychophysicist (think SS Stevens), and people in my field often log transform data prior to analysis. [sent-3, score-0.395]

4 However, this data frequently contains zeros, so I’ve tried using quarter root transforms to get around this. [sent-4, score-1.239]

5 But until today, I had never tried to back transform the plot axis for readability. [sent-5, score-0.595]

6 I assumed this would be straightforward – alas it is not. [sent-6, score-0.317]

7 Specifically, we quarter root transformed our data, performed an ANOVA, got what we thought was a reasonable effect, and then plotted the data. [sent-7, score-1.035]

8 However, the LS means in question are below 1, meaning that raising them to the 4th power just makes them smaller, and uninterpretable in the original metric. [sent-9, score-0.354]

9 Do you have any thoughts or insights you might share? [sent-10, score-0.085]

10 My reply: I don’t see the problem with predicted values less than 1, but you can get negative values, which is kind of weird. [sent-11, score-0.381]

11 I remember we had similar issues when using the square root power for missing-data imputation. [sent-12, score-0.821]

12 For some examples (such as imputation of income) the square-root power worked well, but because of the negative values we couldn’t just throw it into our imputation program as a default. [sent-13, score-1.041]

13 Another approach would be to think more seriously about what those zeros really imply, and perhaps use some latent-variable model. [sent-14, score-0.319]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('root', 0.424), ('quarter', 0.314), ('transform', 0.31), ('zeros', 0.251), ('imputation', 0.181), ('power', 0.179), ('values', 0.179), ('however', 0.146), ('ss', 0.144), ('stevens', 0.136), ('transforms', 0.136), ('alas', 0.13), ('hayes', 0.13), ('negative', 0.125), ('tried', 0.124), ('today', 0.113), ('plotted', 0.11), ('transformed', 0.106), ('straightforward', 0.105), ('anova', 0.103), ('square', 0.099), ('frequently', 0.098), ('axis', 0.092), ('raising', 0.091), ('insight', 0.086), ('wall', 0.085), ('insights', 0.085), ('log', 0.085), ('meaning', 0.084), ('contains', 0.083), ('imply', 0.082), ('assumed', 0.082), ('fan', 0.081), ('performed', 0.081), ('hoping', 0.081), ('training', 0.08), ('hit', 0.079), ('predicted', 0.077), ('throw', 0.075), ('specifically', 0.074), ('smaller', 0.074), ('plot', 0.069), ('seriously', 0.068), ('couldn', 0.065), ('share', 0.065), ('income', 0.064), ('program', 0.062), ('using', 0.06), ('remember', 0.059), ('worked', 0.059)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 1142 andrew gelman stats-2012-01-29-Difficulties with the 1-4-power transformation

Introduction: John Hayes writes: I am a fan of the quarter root transform ever since reading about it on your blog . However, today my student and I hit a wall that I’m hoping you might have some insight on. By training, I am a psychophysicist (think SS Stevens), and people in my field often log transform data prior to analysis. However, this data frequently contains zeros, so I’ve tried using quarter root transforms to get around this. But until today, I had never tried to back transform the plot axis for readability. I assumed this would be straightforward – alas it is not. Specifically, we quarter root transformed our data, performed an ANOVA, got what we thought was a reasonable effect, and then plotted the data. So far so good. However, the LS means in question are below 1, meaning that raising them to the 4th power just makes them smaller, and uninterpretable in the original metric. Do you have any thoughts or insights you might share? My reply: I don’t see the problem with pre

2 0.12366356 705 andrew gelman stats-2011-05-10-Some interesting unpublished ideas on survey weighting

Introduction: A couple years ago we had an amazing all-star session at the Joint Statistical Meetings. The topic was new approaches to survey weighting (which is a mess , as I’m sure you’ve heard). Xiao-Li Meng recommended shrinking weights by taking them to a fractional power (such as square root) instead of trimming the extremes. Rod Little combined design-based and model-based survey inference. Michael Elliott used mixture models for complex survey design. And here’s my introduction to the session.

3 0.12355798 1141 andrew gelman stats-2012-01-28-Using predator-prey models on the Canadian lynx series

Introduction: The “Canadian lynx data” is one of the famous examples used in time series analysis. And the usual models that are fit to these data in the statistics time-series literature, don’t work well. Cavan Reilly and Angelique Zeringue write : Reilly and Zeringue then present their analysis. Their simple little predator-prey model with a weakly informative prior way outperforms the standard big-ass autoregression models. Check this out: Or, to put it into numbers, when they fit their model to the first 80 years and predict to the next 34, their root mean square out-of-sample error is 1480 (see scale of data above). In contrast, the standard model fit to these data (the SETAR model of Tong, 1990) has more than twice as many parameters but gets a worse-performing root mean square error of 1600, even when that model is fit to the entire dataset. (If you fit the SETAR or any similar autoregressive model to the first 80 years and use it to predict the next 34, the predictions

4 0.11714494 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

Introduction: A research psychologist writes in with a question that’s so long that I’ll put my answer first, then put the question itself below the fold. Here’s my reply: As I wrote in my Anova paper and in my book with Jennifer Hill, I do think that multilevel models can completely replace Anova. At the same time, I think the central idea of Anova should persist in our understanding of these models. To me the central idea of Anova is not F-tests or p-values or sums of squares, but rather the idea of predicting an outcome based on factors with discrete levels, and understanding these factors using variance components. The continuous or categorical response thing doesn’t really matter so much to me. I have no problem using a normal linear model for continuous outcomes (perhaps suitably transformed) and a logistic model for binary outcomes. I don’t want to throw away interactions just because they’re not statistically significant. I’d rather partially pool them toward zero using an inform

5 0.10464972 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

Introduction: Elena Grewal writes: I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables. My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant

6 0.10402176 608 andrew gelman stats-2011-03-12-Single or multiple imputation?

7 0.090792432 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

8 0.088795811 1406 andrew gelman stats-2012-07-05-Xiao-Li Meng and Xianchao Xie rethink asymptotics

9 0.088520341 923 andrew gelman stats-2011-09-24-What is the normal range of values in a medical test?

10 0.078101836 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

11 0.074357405 315 andrew gelman stats-2010-10-03-He doesn’t trust the fit . . . r=.999

12 0.073949836 2176 andrew gelman stats-2014-01-19-Transformations for non-normal data

13 0.069516063 2072 andrew gelman stats-2013-10-21-The future (and past) of statistical sciences

14 0.069327876 327 andrew gelman stats-2010-10-07-There are never 70 distinct parameters

15 0.06885799 1786 andrew gelman stats-2013-04-03-Hierarchical array priors for ANOVA decompositions

16 0.068734407 1672 andrew gelman stats-2013-01-14-How do you think about the values in a confidence interval?

17 0.067582369 14 andrew gelman stats-2010-05-01-Imputing count data

18 0.067516565 1944 andrew gelman stats-2013-07-18-You’ll get a high Type S error rate if you use classical statistical methods to analyze data from underpowered studies

19 0.067333177 1605 andrew gelman stats-2012-12-04-Write This Book

20 0.066429518 275 andrew gelman stats-2010-09-14-Data visualization at the American Evaluation Association


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.136), (1, 0.029), (2, 0.026), (3, 0.003), (4, 0.044), (5, -0.01), (6, 0.034), (7, -0.002), (8, 0.013), (9, 0.033), (10, -0.004), (11, 0.007), (12, 0.013), (13, -0.024), (14, 0.012), (15, -0.004), (16, -0.002), (17, -0.01), (18, 0.008), (19, -0.012), (20, 0.021), (21, 0.02), (22, -0.015), (23, -0.026), (24, -0.004), (25, -0.002), (26, -0.013), (27, -0.026), (28, 0.019), (29, -0.027), (30, 0.018), (31, 0.026), (32, 0.018), (33, 0.018), (34, -0.007), (35, 0.001), (36, 0.037), (37, 0.023), (38, 0.011), (39, 0.011), (40, -0.01), (41, 0.013), (42, 0.008), (43, 0.016), (44, -0.012), (45, -0.011), (46, -0.002), (47, -0.014), (48, 0.038), (49, 0.013)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.94213742 1142 andrew gelman stats-2012-01-29-Difficulties with the 1-4-power transformation

Introduction: John Hayes writes: I am a fan of the quarter root transform ever since reading about it on your blog . However, today my student and I hit a wall that I’m hoping you might have some insight on. By training, I am a psychophysicist (think SS Stevens), and people in my field often log transform data prior to analysis. However, this data frequently contains zeros, so I’ve tried using quarter root transforms to get around this. But until today, I had never tried to back transform the plot axis for readability. I assumed this would be straightforward – alas it is not. Specifically, we quarter root transformed our data, performed an ANOVA, got what we thought was a reasonable effect, and then plotted the data. So far so good. However, the LS means in question are below 1, meaning that raising them to the 4th power just makes them smaller, and uninterpretable in the original metric. Do you have any thoughts or insights you might share? My reply: I don’t see the problem with pre

2 0.80166435 257 andrew gelman stats-2010-09-04-Question about standard range for social science correlations

Introduction: Andrew Eppig writes: I’m a physicist by training who is transitioning to the social sciences. I recently came across a reference in the Economist to a paper on IQ and parasites which I read as I have more than a passing interest in IQ research (having read much that you and others (e.g., Shalizi, Wicherts) have written). In this paper I note that the authors find a very high correlation between national IQ and parasite prevalence. The strength of the correlation (-0.76 to -0.82) surprised me, as I’m used to much weaker correlations in the social sciences. To me, it’s a bit too high, suggesting that there are other factors at play or that one of the variables is merely a proxy for a large number of other variables. But I have no basis for this other than a gut feeling and a memory of a plot on Language Log about the distribution of correlation coefficients in social psychology. So my question is this: Is a correlation in the range of (-0.82,-0.76) more likely to be a correlatio

3 0.7945503 1918 andrew gelman stats-2013-06-29-Going negative

Introduction: Troels Ring writes: I have measured total phosphorus, TP, on a number of dialysis patients, and also measured conventional phosphate, Pi. Now P is exchanged with the environment as Pi, so in principle a correlation between TP and Pi could perhaps be expected. I’m really most interested in the fraction of TP which is not Pi, that is TP-Pi. I would also expect that to be positively correlated with Pi. However, looking at the data using a mixed model an insignificant negative correlation is obtained. Then I thought, that since TP-Pi is bound to be small if Pi is large a negative correlation is almost dictated by the math even if the biology would have it otherwise in so far as the the TP-Pi, likely organic P, must someday have been Pi. Hence I thought about correcting the slight negative correlation between TP-Pi and Pi for the expected large negative correlation due to the math – to eventually recover what I came from: a positive correlation. People seems to agree that this thinki

4 0.78658563 938 andrew gelman stats-2011-10-03-Comparing prediction errors

Introduction: Someone named James writes: I’m working on a classification task, sentence segmentation. The classifier algorithm we use (BoosTexter, a boosted learning algorithm) classifies each word independently conditional on its features, i.e. a bag-of-words model, so any contextual clues need to be encoded into the features. The feature extraction system I am proposing in my thesis uses a heteroscedastic LDA to transform data to produce the features the classifier runs on. The HLDA system has a couple parameters I’m testing, and I’m running a 3×2 full factorial experiment. That’s the background which may or may not be relevant to the question. The output of each trial is a class (there are only 2 classes, right now) for every word in the dataset. Because of the nature of the task, one class strongly predominates, say 90-95% of the data. My question is this: in terms of overall performance (we use F1 score), many of these trials are pretty close together, which leads me to ask whethe

5 0.78437227 833 andrew gelman stats-2011-07-31-Untunable Metropolis

Introduction: Michael Margolis writes: What are we to make of it when a Metropolis-Hastings step just won’t tune? That is, the acceptance rate is zero at expected-jump-size X, and way above 1/2 at X-exp(-16) (i.e., machine precision ). I’ve solved my practical problem by writing that I would have liked to include results from a diffuse prior, but couldn’t. But I’m bothered by the poverty of my intuition. And since everything I’ve read says this is an issue of efficiency, rather than accuracy, I wonder if I could solve it just by running massive and heavily thinned chains. My reply: I can’t see how this could happen in a well-specified problem! I suspect it’s a bug. Otherwise try rescaling your variables so that your parameters will have values on the order of magnitude of 1. To which Margolis responded: I hardly wrote any of the code, so I can’t speak to the bug question — it’s binomial kriging from the R package geoRglm. And there are no covariates to scale — just the zero and one

6 0.77596349 1462 andrew gelman stats-2012-08-18-Standardizing regression inputs

7 0.76262206 1121 andrew gelman stats-2012-01-15-R-squared for multilevel models

8 0.75348055 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

9 0.75335383 527 andrew gelman stats-2011-01-20-Cars vs. trucks

10 0.75181693 888 andrew gelman stats-2011-09-03-A psychology researcher asks: Is Anova dead?

11 0.74572533 1070 andrew gelman stats-2011-12-19-The scope for snooping

12 0.74032563 923 andrew gelman stats-2011-09-24-What is the normal range of values in a medical test?

13 0.7398085 561 andrew gelman stats-2011-02-06-Poverty, educational performance – and can be done about it

14 0.7378906 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable

15 0.73536092 1703 andrew gelman stats-2013-02-02-Interaction-based feature selection and classification for high-dimensional biological data

16 0.73416358 327 andrew gelman stats-2010-10-07-There are never 70 distinct parameters

17 0.73278284 1330 andrew gelman stats-2012-05-19-Cross-validation to check missing-data imputation

18 0.73088348 1094 andrew gelman stats-2011-12-31-Using factor analysis or principal components analysis or measurement-error models for biological measurements in archaeology?

19 0.73081356 1346 andrew gelman stats-2012-05-27-Average predictive comparisons when changing a pair of variables

20 0.72961998 650 andrew gelman stats-2011-04-05-Monitor the efficiency of your Markov chain sampler using expected squared jumped distance!


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(5, 0.011), (9, 0.264), (16, 0.081), (17, 0.014), (21, 0.045), (23, 0.011), (24, 0.109), (30, 0.015), (45, 0.011), (55, 0.017), (64, 0.011), (79, 0.013), (86, 0.025), (99, 0.274)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.98456991 577 andrew gelman stats-2011-02-16-Annals of really really stupid spam

Introduction: This came in the inbox today: Dear Dr. Gelman, GenWay recently found your article titled “Multiple imputation for model checking: completed-data plots with missing and latent data.” (Biometrics. 2005 Mar;61(1):74-85.) and thought you might be interested in learning about our superior quality signaling proteins. GenWay prides itself on being a leader in customer service aiming to exceed your expectations with the quality and price of our products. With more than 60,000 reagents backed by our outstanding guarantee you are sure to find the products you have been searching for. Please feel free to visit the following resource pages: * Apoptosis Pathway (product list) * Adipocytokine (product list) * Cell Cycle Pathway (product list) * Jak STAT (product list) * GnRH (product list) * MAPK (product list) * mTOR (product list) * T Cell Receptor (product list) * TGF-beta (product list) * Wnt (product list) * View All Pathways

2 0.9791187 993 andrew gelman stats-2011-11-05-The sort of thing that gives technocratic reasoning a bad name

Introduction: 1. Freakonomics characterizes drunk driving as an example of “the human tendency to worry about rare problems that are unlikely to happen.” 2. The CDC reports , “Alcohol-impaired drivers are involved in about 1 in 3 crash deaths, resulting in nearly 11,000 deaths in 2009.” No offense to the tenured faculty at the University of Chicago, but I’m going with the CDC on this one. P.S. The Freakonomics blog deserves to be dinged another time, not just for claiming, based on implausible assumptions and making the all-else-equal fallacy that “drunk walking is 8 times more likely to result in your death than drunk driving” but for presenting this weak inference as a fact rather than as a speculation. When doing “Freakonomics,” you can be counterintuitive, or you can be sensible, but it’s hard to be both. I mean, sure, sometimes you can be. But there’s a tradeoff, and in this case, they’re choosing to push the envelope on counterintuitiveness.

3 0.96984756 529 andrew gelman stats-2011-01-21-“City Opens Inquiry on Grading Practices at a Top-Scoring Bronx School”

Introduction: Sharon Otterman reports : When report card grades were released in the fall for the city’s 455 high schools, the highest score went to a small school in a down-and-out section of the Bronx . . . A stunning 94 percent of its seniors graduated, more than 30 points above the citywide average. . . . “When I interviewed for the school,” said Sam Buchbinder, a history teacher, “it was made very clear: this is a school that doesn’t believe in anyone failing.” That statement was not just an exhortation to excellence. It was school policy. By order of the principal, codified in the school’s teacher handbook, all teachers should grade their classes in the same way: 30 percent of students should earn a grade in the A range, 40 percent B’s, 25 percent C’s, and no more than 5 percent D’s. As long as they show up, they should not fail. Hey, that sounds like Harvard and Columbia^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H various selective northeastern colleges I’ve known. Of course, we^H^H

4 0.96923423 1356 andrew gelman stats-2012-05-31-Question 21 of my final exam for Design and Analysis of Sample Surveys

Introduction: 21. A country is divided into three regions with populations of 2 million, 2 million, and 0.5 million, respectively. A survey is done asking about foreign policy opinions.. Somebody proposes taking a sample of 50 people from each reason. Give a reason why this non-proportional sample would not usually be done, and also a reason why it might actually be a good idea. Solution to question 20 From yesterday : 20. Explain in two sentences why we expect survey respondents to be honest about vote preferences but possibly dishonest about reporting unhealty behaviors. Solution: Respondents tend to be sincere about vote preferences because this affects the outcome of the poll, and people are motivated to have their candidate poll well. This motivation is typically not present in reporting behaviors; you have no particular reason for wanting to affect the average survey response.

5 0.96788543 1291 andrew gelman stats-2012-04-30-Systematic review of publication bias in studies on publication bias

Introduction: Via Yalda Afshar , a 2005 paper by Hans-Hermann Dubben and Hans-Peter Beck-Bornholdt: Publication bias is a well known phenomenon in clinical literature, in which positive results have a better chance of being published, are published earlier, and are published in journals with higher impact factors. Conclusions exclusively based on published studies, therefore, can be misleading. Selective under-reporting of research might be more widespread and more likely to have adverse consequences for patients than publication of deliberately falsified data. We investigated whether there is preferential publication of positive papers on publication bias. They conclude, “We found no evidence of publication bias in reports on publication bias.” But of course that’s the sort of finding regarding publication bias of findings on publication bias that you’d expect would get published. What we really need is a careful meta-analysis to estimate the level of publication bias in studies of publi

6 0.96295726 1532 andrew gelman stats-2012-10-13-A real-life dollar auction game!

7 0.95634222 1424 andrew gelman stats-2012-07-22-Extreme events as evidence for differences in distributions

8 0.95036137 1664 andrew gelman stats-2013-01-10-Recently in the sister blog: Brussels sprouts, ugly graphs, and switched at birth

9 0.94385684 1332 andrew gelman stats-2012-05-20-Problemen met het boek

10 0.93180138 1961 andrew gelman stats-2013-07-29-Postdocs in probabilistic modeling! With David Blei! And Stan!

11 0.92347443 1565 andrew gelman stats-2012-11-06-Why it can be rational to vote

12 0.92347437 389 andrew gelman stats-2010-11-01-Why it can be rational to vote

13 0.92038041 29 andrew gelman stats-2010-05-12-Probability of successive wins in baseball

same-blog 14 0.9198606 1142 andrew gelman stats-2012-01-29-Difficulties with the 1-4-power transformation

15 0.91398263 1110 andrew gelman stats-2012-01-10-Jobs in statistics research! In New Jersey!

16 0.90865386 560 andrew gelman stats-2011-02-06-Education and Poverty

17 0.90755928 1226 andrew gelman stats-2012-03-22-Story time meets the all-else-equal fallacy and the fallacy of measurement

18 0.89264441 1566 andrew gelman stats-2012-11-07-A question about voting systems—unrelated to U.S. elections!

19 0.88680172 1715 andrew gelman stats-2013-02-09-Thomas Hobbes would be spinning in his grave

20 0.88237476 640 andrew gelman stats-2011-03-31-Why Edit Wikipedia?