andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-2046 knowledge-graph by maker-knowledge-mining

2046 andrew gelman stats-2013-10-01-I’ll say it again


meta infos for this blog

Source: html

Introduction: Milan Valasek writes: Psychology students (and probably students in other disciplines) are often taught that in order to perform ‘parametric’ tests, e.g. independent t-test, the data for each group need to be normally distributed. However, in literature (and various university lecture notes and slides accessible online), I have come across at least 4 different interpretation of what it is that is supposed to be normally distributed when doing a t-test: 1. population 2. sampled data for each group 3. distribution of estimates of means for each group 4. distribution of estimates of the difference between groups I can see how 2 would follow from 1 and 4 from 3 but even then, there are two different sets of interpretations of the normality assumption. Could you please put this issue to rest for me? My quick response is that normality is not so important unless you are focusing on prediction.


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Milan Valasek writes: Psychology students (and probably students in other disciplines) are often taught that in order to perform ‘parametric’ tests, e. [sent-1, score-0.788]

2 independent t-test, the data for each group need to be normally distributed. [sent-3, score-0.742]

3 However, in literature (and various university lecture notes and slides accessible online), I have come across at least 4 different interpretation of what it is that is supposed to be normally distributed when doing a t-test: 1. [sent-4, score-1.804]

4 distribution of estimates of means for each group 4. [sent-7, score-0.662]

5 distribution of estimates of the difference between groups I can see how 2 would follow from 1 and 4 from 3 but even then, there are two different sets of interpretations of the normality assumption. [sent-8, score-1.369]

6 My quick response is that normality is not so important unless you are focusing on prediction. [sent-10, score-0.831]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('normality', 0.409), ('normally', 0.31), ('group', 0.255), ('milan', 0.235), ('parametric', 0.173), ('disciplines', 0.173), ('distribution', 0.162), ('estimates', 0.161), ('sampled', 0.157), ('students', 0.156), ('lecture', 0.155), ('interpretations', 0.154), ('accessible', 0.151), ('distributed', 0.148), ('slides', 0.14), ('focusing', 0.135), ('taught', 0.127), ('notes', 0.122), ('perform', 0.122), ('sets', 0.115), ('supposed', 0.114), ('independent', 0.113), ('prediction', 0.113), ('unless', 0.112), ('interpretation', 0.112), ('rest', 0.109), ('online', 0.104), ('tests', 0.103), ('groups', 0.102), ('quick', 0.096), ('please', 0.096), ('different', 0.093), ('follow', 0.092), ('order', 0.091), ('psychology', 0.091), ('population', 0.088), ('literature', 0.087), ('means', 0.084), ('university', 0.084), ('across', 0.083), ('difference', 0.081), ('response', 0.079), ('however', 0.079), ('issue', 0.078), ('probably', 0.074), ('various', 0.073), ('come', 0.068), ('least', 0.064), ('data', 0.064), ('often', 0.062)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 2046 andrew gelman stats-2013-10-01-I’ll say it again

Introduction: Milan Valasek writes: Psychology students (and probably students in other disciplines) are often taught that in order to perform ‘parametric’ tests, e.g. independent t-test, the data for each group need to be normally distributed. However, in literature (and various university lecture notes and slides accessible online), I have come across at least 4 different interpretation of what it is that is supposed to be normally distributed when doing a t-test: 1. population 2. sampled data for each group 3. distribution of estimates of means for each group 4. distribution of estimates of the difference between groups I can see how 2 would follow from 1 and 4 from 3 but even then, there are two different sets of interpretations of the normality assumption. Could you please put this issue to rest for me? My quick response is that normality is not so important unless you are focusing on prediction.

2 0.28454214 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit

3 0.18047208 602 andrew gelman stats-2011-03-06-Assumptions vs. conditions

Introduction: Jeff Witmer writes: I noticed that you continue the standard practice in statistics of referring to assumptions; e.g. a blog entry on 2/4/11 at 10:54: “Our method, just like any model, relies on assumptions which we have the duty to state and to check.” I’m in the 6th year of a three-year campaign to get statisticians to drop the word “assumptions” and replace it with “conditions.” The problem, as I see it, is that people tend to think that an assumption is something that one assumes, as in “assuming that we have a right triangle…” or “assuming that k is even…” when constructing a mathematical proof. But in statistics we don’t assume things — unless we have to. Instead, we know that, for example, the validity of a t-test depends on normality, which is a condition that can and should be checked. Let’s not call normality an assumption, lest we imply that it is something that can be assumed. Let’s call it a condition. What do you all think?

4 0.15452147 2176 andrew gelman stats-2014-01-19-Transformations for non-normal data

Introduction: Steve Peterson writes: I recently submitted a proposal on applying a Bayesian analysis to gender comparisons on motivational constructs. I had an idea on how to improve the model I used and was hoping you could give me some feedback. The data come from a survey based on 5-point Likert scales. Different constructs are measured for each student as scores derived from averaging a student’s responses on particular subsets of survey questions. (I suppose it is not uncontroversial to treat these scores as interval measures and would be interested to hear if you have any objections.) I am comparing genders on each construct. Researchers typically use t-tests to do so. To use a Bayesian approach I applied the programs written in R and JAGS by John Kruschke for estimating the difference of means: http://www.indiana.edu/~kruschke/BEST/ An issue in that analysis is that the distributions of student scores are not normal. There was skewness in some of the distributions and not always in

5 0.12900609 1352 andrew gelman stats-2012-05-29-Question 19 of my final exam for Design and Analysis of Sample Surveys

Introduction: 19. A survey is taken of students in a metropolitan area. At the first stage a school is sampled at random. The schools are divided into two strata: 20 private schools and 50 public schools are sampled. At the second stage, 5 classes are sampled within each sampled school. At the third stage, 10 students are sampled within each class. What is the probability that any given student is sampled? Express this in terms of the number of students in the class, number of classes in the school, and number of schools in the area. Define appropriate notation as needed. Solution to question 18 From yesterday : 18. A survey is taken of 100 undergraduates, 100 graduate students, and 100 continuing education students at a university. Assume a simple random sample within each group. Each student is asked to rate his or her satisfaction (on a 1–10 scale) with his or her experiences. Write the estimate and standard error of the average satisfaction of all the students at the university. Introd

6 0.11631264 39 andrew gelman stats-2010-05-18-The 1.6 rule

7 0.11101902 1353 andrew gelman stats-2012-05-30-Question 20 of my final exam for Design and Analysis of Sample Surveys

8 0.10801455 1414 andrew gelman stats-2012-07-12-Steven Pinker’s unconvincing debunking of group selection

9 0.10526753 2066 andrew gelman stats-2013-10-17-G+ hangout for test run of BDA course

10 0.10473156 1527 andrew gelman stats-2012-10-10-Another reason why you can get good inferences from a bad model

11 0.095644593 2099 andrew gelman stats-2013-11-13-“What are some situations in which the classical approach (or a naive implementation of it, based on cookbook recipes) gives worse results than a Bayesian approach, results that actually impeded the science?”

12 0.095388591 2008 andrew gelman stats-2013-09-04-Does it matter that a sample is unrepresentative? It depends on the size of the treatment interactions

13 0.095245153 1267 andrew gelman stats-2012-04-17-Hierarchical-multilevel modeling with “big data”

14 0.095071077 1965 andrew gelman stats-2013-08-02-My course this fall on l’analyse bayésienne de données

15 0.094656371 972 andrew gelman stats-2011-10-25-How do you interpret standard errors from a regression fit to the entire population?

16 0.092717662 2041 andrew gelman stats-2013-09-27-Setting up Jitts online

17 0.090014927 326 andrew gelman stats-2010-10-07-Peer pressure, selection, and educational reform

18 0.088620923 1752 andrew gelman stats-2013-03-06-Online Education and Jazz

19 0.087462194 1517 andrew gelman stats-2012-10-01-“On Inspiring Students and Being Human”

20 0.086822711 516 andrew gelman stats-2011-01-14-A new idea for a science core course based entirely on computer simulation


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.143), (1, 0.037), (2, 0.023), (3, -0.044), (4, 0.052), (5, 0.085), (6, 0.015), (7, 0.064), (8, -0.057), (9, 0.008), (10, 0.032), (11, 0.041), (12, 0.004), (13, -0.028), (14, -0.002), (15, -0.008), (16, -0.034), (17, -0.019), (18, -0.013), (19, 0.014), (20, 0.032), (21, 0.014), (22, 0.015), (23, -0.077), (24, 0.009), (25, 0.001), (26, 0.023), (27, 0.004), (28, 0.033), (29, 0.038), (30, 0.009), (31, -0.003), (32, -0.007), (33, 0.021), (34, 0.027), (35, 0.015), (36, -0.023), (37, -0.021), (38, 0.003), (39, 0.012), (40, 0.061), (41, -0.052), (42, 0.048), (43, -0.032), (44, -0.005), (45, 0.026), (46, 0.024), (47, 0.027), (48, -0.007), (49, 0.001)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96319431 2046 andrew gelman stats-2013-10-01-I’ll say it again

Introduction: Milan Valasek writes: Psychology students (and probably students in other disciplines) are often taught that in order to perform ‘parametric’ tests, e.g. independent t-test, the data for each group need to be normally distributed. However, in literature (and various university lecture notes and slides accessible online), I have come across at least 4 different interpretation of what it is that is supposed to be normally distributed when doing a t-test: 1. population 2. sampled data for each group 3. distribution of estimates of means for each group 4. distribution of estimates of the difference between groups I can see how 2 would follow from 1 and 4 from 3 but even then, there are two different sets of interpretations of the normality assumption. Could you please put this issue to rest for me? My quick response is that normality is not so important unless you are focusing on prediction.

2 0.76674205 2041 andrew gelman stats-2013-09-27-Setting up Jitts online

Introduction: I use just-in-time teaching assignments in all my classes now. Vince helpfully sent along these instructions for setting these up on Google. See below. I think Jitts are just wonderful, and they’re so easy to set up, you should definitely be doing them in your classes too. I’ve had more difficulty with Peer Instruction (the companion tool to just-in-time teaching) as it requires questions at just the right level for the class. I do have students frequently work in pairs, though, so I think I get some of the benefit of that. P.S. I’d love to share all the Jitts with you for Bayesian Data Analysis, but I’m afraid this would poison the well and future students would not have the opportunity to be surprised by them. Yes, I know, I should just come up with new ones every year—but I’m not quite ready to do that! Perhaps soon I will. In the meantime, a commenter asked for some Jitts, so here are the ones for the first and last weeks of class: Jitt questions for Bayesian Data

3 0.70665854 277 andrew gelman stats-2010-09-14-In an introductory course, when does learning occur?

Introduction: Now that September has arrived, it’s time for us to think teaching. Here’s something from Andrew Heckler and Eleanor Sayre. Heckler writes: The article describes a project studying the performance of university level students taking an intro physics course. Every week for ten weeks we took 1/10th of the students (randomly selected only once) and gave them the same set of questions relevant to the course. This allowed us to plot the evolution of average performance in the class during the quarter. We can then determine when learning occurs: For example, do they learn the material in a relevant lecture or lab or homework? Since we had about 350 students taking the course, we could get some reasonable stats. In particular, you might be interested in Figure 10 (page 774) which shows student performance day-by-day on a particular question. The performance does not change directly after lecture, but rather only when the homework was due. [emphasis added] We could not find any oth

4 0.70021325 402 andrew gelman stats-2010-11-09-Kaggle: forecasting competitions in the classroom

Introduction: Anthony Goldbloom writes: For those who haven’t come across Kaggle, we are a new platform for data prediction competitions. Companies and researchers put up a dataset and a problem and data scientists compete to produce the best solutions. We’ve just launched a new initiative called Kaggle in Class, allowing instructors to host competitions for their students. Competitions are a neat way to engage students, giving them the opportunity to put into practice what they learn. The platform offers live leaderboards, so students get instant feedback on the accuracy of their work. And since competitions are judged on objective criteria (predictions are compared with outcomes), the platform offers unique assessment opportunities. The first Kaggle in Class competition is being hosted by Stanford University’s Stats 202 class and requires students to predict the price of different wines based on vintage, country, ratings and other information. Those interested in hosting a competition f

5 0.6722827 2083 andrew gelman stats-2013-10-31-Value-added modeling in education: Gaming the system by sending kids on a field trip at test time

Introduction: Just in time for Halloween, here’s a horror story for you . . . Howard Wainer writes: In my book “Uneducated Guesses” in the chapter on value-added models, I discuss how the treatment of missing data can have a profound effect on the estimates of teacher scores. I made up how a principal might send the best students on a field trip at the beginning of the year when the ‘pre-test’ was given (and their scores would be imputed from the students who showed up) and that the bottom half of the class would have a matching field trip on the day of the post test. Everyone laughed. But apparently someone decided to take it seriously. http://www.amren.com/news/2012/10/el-paso-schools-confront-scandal-of-students-who-disappeared-at-test-time/ http://www.elpasotimes.com/episd/ci_20848628/former-episd-superintendent-lorenzo-garcia-enter-plea-aggreement You can’t make this stuff up. This sort of thing is not surprising but it’s worth keeping in mind. That a measurement system c

6 0.67155719 956 andrew gelman stats-2011-10-13-Hey, you! Don’t take that class!

7 0.66510946 326 andrew gelman stats-2010-10-07-Peer pressure, selection, and educational reform

8 0.66412944 1517 andrew gelman stats-2012-10-01-“On Inspiring Students and Being Human”

9 0.65007722 1657 andrew gelman stats-2013-01-06-Lee Nguyen Tran Kim Song Shimazaki

10 0.64628857 957 andrew gelman stats-2011-10-14-Questions about a study of charter schools

11 0.63593286 938 andrew gelman stats-2011-10-03-Comparing prediction errors

12 0.62592351 1864 andrew gelman stats-2013-05-20-Evaluating Columbia University’s Frontiers of Science course

13 0.62363011 1943 andrew gelman stats-2013-07-18-Data to use for in-class sampling exercises?

14 0.62275159 606 andrew gelman stats-2011-03-10-It’s no fun being graded on a curve

15 0.616328 1008 andrew gelman stats-2011-11-13-Student project competition

16 0.61174417 226 andrew gelman stats-2010-08-23-More on those L.A. Times estimates of teacher effectiveness

17 0.61159158 213 andrew gelman stats-2010-08-17-Matching at two levels

18 0.61017722 2128 andrew gelman stats-2013-12-09-How to model distributions that have outliers in one direction

19 0.60482353 315 andrew gelman stats-2010-10-03-He doesn’t trust the fit . . . r=.999

20 0.60424101 1722 andrew gelman stats-2013-02-14-Statistics for firefighters: update


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(16, 0.045), (18, 0.17), (24, 0.206), (42, 0.042), (84, 0.062), (99, 0.351)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97791773 2046 andrew gelman stats-2013-10-01-I’ll say it again

Introduction: Milan Valasek writes: Psychology students (and probably students in other disciplines) are often taught that in order to perform ‘parametric’ tests, e.g. independent t-test, the data for each group need to be normally distributed. However, in literature (and various university lecture notes and slides accessible online), I have come across at least 4 different interpretation of what it is that is supposed to be normally distributed when doing a t-test: 1. population 2. sampled data for each group 3. distribution of estimates of means for each group 4. distribution of estimates of the difference between groups I can see how 2 would follow from 1 and 4 from 3 but even then, there are two different sets of interpretations of the normality assumption. Could you please put this issue to rest for me? My quick response is that normality is not so important unless you are focusing on prediction.

2 0.97033715 969 andrew gelman stats-2011-10-22-Researching the cost-effectiveness of political lobbying organisations

Introduction: Sally Murray from Giving What We Can writes: We are an organisation that assesses different charitable (/fundable) interventions, to estimate which are the most cost-effective (measured in terms of the improvement of life for people in developing countries gained for every dollar invested). Our research guides and encourages greater donations to the most cost-effective charities we thus identify, and our members have so far pledged a total of $14m to these causes, with many hundreds more relying on our advice in a less formal way. I am specifically researching the cost-effectiveness of political lobbying organisations. We are initially focusing on organisations that lobby for ‘big win’ outcomes such as increased funding of the most cost-effective NTD treatments/ vaccine research, changes to global trade rules (potentially) and more obscure lobbies such as “Keep Antibiotics Working”. We’ve a great deal of respect for your work and the superbly rational way you go about it, and

3 0.95941699 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?

Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit

4 0.9573518 1292 andrew gelman stats-2012-05-01-Colorless green facts asserted resolutely

Introduction: Thomas Basbøll [yes, I've learned how to smoothly do this using alt-o] gives some writing advice : What gives a text presence is our commitment to asserting facts. We have to face the possibility that we may be wrong about them resolutely, and we do this by writing about them as though we are right. This and an earlier remark by Basbøll are closely related in my mind to predictive model checking and to Bayesian statistics : we make strong assumptions and then engage the data and the assumptions in a dialogue: assumptions + data -> inference, and we can then compare the inference to the data which can reveal problems with our model (or problems with the data, but that’s really problems with the model too, in this case problems with the model for the data). I like the idea that a condition for a story to be useful is that we put some belief into it. (One doesn’t put belief into a joke.) And also the converse, that thnking hard about a story and believing it can be the pre

5 0.95045638 1691 andrew gelman stats-2013-01-25-Extreem p-values!

Introduction: Joshua Vogelstein writes: I know you’ve discussed this on your blog in the past, but I don’t know exactly how you’d answer the following query: Suppose you run an analysis and obtain a p-value of 10^-300. What would you actually report? I’m fairly confident that I’m not that confident :) I’m guessing: “p-value \approx 0.” One possibility is to determine the accuracy with this one *could* in theory know, by virtue of the sample size, and say that p-value is less than or equal to that? For example, if I used a Monte Carlo approach to generate the null distribution with 10,000 samples, and I found that the observed value was more extreme than all of the sample values, then I might say that p is less than or equal to 1/10,000. My reply: Mosteller and Wallace talked a bit about this in their book, the idea that there are various other 1-in-a-million possibilities (for example, the data were faked somewhere before they got to you) so p-values such as 10^-6 don’t really mean an

6 0.9490571 588 andrew gelman stats-2011-02-24-In case you were wondering, here’s the price of milk

7 0.94168758 698 andrew gelman stats-2011-05-05-Shocking but not surprising

8 0.9416247 1319 andrew gelman stats-2012-05-14-I hate to get all Gerd Gigerenzer on you here, but . . .

9 0.93907124 114 andrew gelman stats-2010-06-28-More on Bayesian deduction-induction

10 0.93242371 1074 andrew gelman stats-2011-12-20-Reading a research paper != agreeing with its claims

11 0.92881417 718 andrew gelman stats-2011-05-18-Should kids be able to bring their own lunches to school?

12 0.92466712 2338 andrew gelman stats-2014-05-19-My short career as a Freud expert

13 0.92190307 2136 andrew gelman stats-2013-12-16-Whither the “bet on sparsity principle” in a nonsparse world?

14 0.92158568 829 andrew gelman stats-2011-07-29-Infovis vs. statgraphics: A clear example of their different goals

15 0.91948265 456 andrew gelman stats-2010-12-07-The red-state, blue-state war is happening in the upper half of the income distribution

16 0.91749233 815 andrew gelman stats-2011-07-22-Statistical inference based on the minimum description length principle

17 0.9132995 247 andrew gelman stats-2010-09-01-How does Bayes do it?

18 0.91309613 621 andrew gelman stats-2011-03-20-Maybe a great idea in theory, didn’t work so well in practice

19 0.91178566 1732 andrew gelman stats-2013-02-22-Evaluating the impacts of welfare reform?

20 0.91091657 2148 andrew gelman stats-2013-12-25-Spam!