andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-602 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Jeff Witmer writes: I noticed that you continue the standard practice in statistics of referring to assumptions; e.g. a blog entry on 2/4/11 at 10:54: “Our method, just like any model, relies on assumptions which we have the duty to state and to check.” I’m in the 6th year of a three-year campaign to get statisticians to drop the word “assumptions” and replace it with “conditions.” The problem, as I see it, is that people tend to think that an assumption is something that one assumes, as in “assuming that we have a right triangle…” or “assuming that k is even…” when constructing a mathematical proof. But in statistics we don’t assume things — unless we have to. Instead, we know that, for example, the validity of a t-test depends on normality, which is a condition that can and should be checked. Let’s not call normality an assumption, lest we imply that it is something that can be assumed. Let’s call it a condition. What do you all think?
sentIndex sentText sentNum sentScore
1 Jeff Witmer writes: I noticed that you continue the standard practice in statistics of referring to assumptions; e. [sent-1, score-0.615]
2 a blog entry on 2/4/11 at 10:54: “Our method, just like any model, relies on assumptions which we have the duty to state and to check. [sent-3, score-0.874]
3 ” I’m in the 6th year of a three-year campaign to get statisticians to drop the word “assumptions” and replace it with “conditions. [sent-4, score-0.714]
4 ” The problem, as I see it, is that people tend to think that an assumption is something that one assumes, as in “assuming that we have a right triangle…” or “assuming that k is even…” when constructing a mathematical proof. [sent-5, score-0.788]
5 But in statistics we don’t assume things — unless we have to. [sent-6, score-0.343]
6 Instead, we know that, for example, the validity of a t-test depends on normality, which is a condition that can and should be checked. [sent-7, score-0.432]
7 Let’s not call normality an assumption, lest we imply that it is something that can be assumed. [sent-8, score-1.016]
wordName wordTfidf (topN-words)
[('normality', 0.403), ('assumptions', 0.303), ('assumption', 0.232), ('witmer', 0.231), ('assuming', 0.222), ('triangle', 0.218), ('lest', 0.218), ('assumes', 0.182), ('call', 0.173), ('relies', 0.165), ('duty', 0.161), ('constructing', 0.159), ('condition', 0.145), ('replace', 0.137), ('drop', 0.136), ('campaign', 0.136), ('referring', 0.136), ('imply', 0.132), ('validity', 0.131), ('let', 0.13), ('depends', 0.119), ('jeff', 0.119), ('entry', 0.117), ('unless', 0.111), ('continue', 0.11), ('noticed', 0.108), ('word', 0.106), ('mathematical', 0.105), ('tend', 0.1), ('statistics', 0.097), ('statisticians', 0.094), ('practice', 0.091), ('something', 0.09), ('method', 0.084), ('assume', 0.083), ('state', 0.077), ('instead', 0.075), ('standard', 0.073), ('year', 0.072), ('right', 0.052), ('things', 0.052), ('blog', 0.051), ('problem', 0.051), ('think', 0.05), ('model', 0.046), ('example', 0.038), ('know', 0.037), ('even', 0.035), ('writes', 0.035), ('get', 0.033)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999988 602 andrew gelman stats-2011-03-06-Assumptions vs. conditions
Introduction: Jeff Witmer writes: I noticed that you continue the standard practice in statistics of referring to assumptions; e.g. a blog entry on 2/4/11 at 10:54: “Our method, just like any model, relies on assumptions which we have the duty to state and to check.” I’m in the 6th year of a three-year campaign to get statisticians to drop the word “assumptions” and replace it with “conditions.” The problem, as I see it, is that people tend to think that an assumption is something that one assumes, as in “assuming that we have a right triangle…” or “assuming that k is even…” when constructing a mathematical proof. But in statistics we don’t assume things — unless we have to. Instead, we know that, for example, the validity of a t-test depends on normality, which is a condition that can and should be checked. Let’s not call normality an assumption, lest we imply that it is something that can be assumed. Let’s call it a condition. What do you all think?
2 0.30088347 1967 andrew gelman stats-2013-08-04-What are the key assumptions of linear regression?
Introduction: Andy Cooper writes: A link to an article , “Four Assumptions Of Multiple Regression That Researchers Should Always Test”, has been making the rounds on Twitter. Their first rule is “Variables are Normally distributed.” And they seem to be talking about the independent variables – but then later bring in tests on the residuals (while admitting that the normally-distributed error assumption is a weak assumption). I thought we had long-since moved away from transforming our independent variables to make them normally distributed for statistical reasons (as opposed to standardizing them for interpretability, etc.) Am I missing something? I agree that leverage in a influence is important, but normality of the variables? The article is from 2002, so it might be dated, but given the popularity of the tweet, I thought I’d ask your opinion. My response: There’s some useful advice on that page but overall I think the advice was dated even in 2002. In section 3.6 of my book wit
3 0.18047208 2046 andrew gelman stats-2013-10-01-I’ll say it again
Introduction: Milan Valasek writes: Psychology students (and probably students in other disciplines) are often taught that in order to perform ‘parametric’ tests, e.g. independent t-test, the data for each group need to be normally distributed. However, in literature (and various university lecture notes and slides accessible online), I have come across at least 4 different interpretation of what it is that is supposed to be normally distributed when doing a t-test: 1. population 2. sampled data for each group 3. distribution of estimates of means for each group 4. distribution of estimates of the difference between groups I can see how 2 would follow from 1 and 4 from 3 but even then, there are two different sets of interpretations of the normality assumption. Could you please put this issue to rest for me? My quick response is that normality is not so important unless you are focusing on prediction.
4 0.17499557 603 andrew gelman stats-2011-03-07-Assumptions vs. conditions, part 2
Introduction: In response to the discussion of his remarks on assumptions vs. conditions, Jeff Witmer writes : If [certain conditions hold] , then the t-test p-value gives a remarkably good approximation to “the real thing” — namely the randomization reference p-value. . . . I [Witmer] make assumptions about conditions that I cannot check, e.g., that the data arose from a random sample. Of course, just as there is no such thing as a normal population, there is no such thing as a random sample. I disagree strongly with both the above paragraphs! I say this not to pick a fight with Jeff Witmer but to illustrate how, in statistics, even the most basic points that people take for granted, can’t be. Let’s take the claims in order: 1. The purpose of a t test is to approximate the randomization p-value. Not to me. In my world, the purpose of t tests and intervals is to summarize uncertainty in estimates and comparisons. I don’t care about a p-value and almost certainly don’t care a
Introduction: Elias Bareinboim asked what I thought about his comment on selection bias in which he referred to a paper by himself and Judea Pearl, “Controlling Selection Bias in Causal Inference.” I replied that I have no problem with what he wrote, but that from my perspective I find it easier to conceptualize such problems in terms of multilevel models. I elaborated on that point in a recent post , “Hierarchical modeling as a framework for extrapolation,” which I think was read by only a few people (I say this because it received only two comments). I don’t think Bareinboim objected to anything I wrote, but like me he is comfortable working within his own framework. He wrote the following to me: In some sense, “not ad hoc” could mean logically consistent. In other words, if one agrees with the assumptions encoded in the model, one must also agree with the conclusions entailed by these assumptions. I am not aware of any other way of doing mathematics. As it turns out, to get causa
6 0.14993902 554 andrew gelman stats-2011-02-04-An addition to the model-makers’ oath
7 0.12016198 2359 andrew gelman stats-2014-06-04-All the Assumptions That Are My Life
8 0.11988005 1527 andrew gelman stats-2012-10-10-Another reason why you can get good inferences from a bad model
10 0.10955173 451 andrew gelman stats-2010-12-05-What do practitioners need to know about regression?
13 0.090967812 299 andrew gelman stats-2010-09-27-what is = what “should be” ??
14 0.089029811 1004 andrew gelman stats-2011-11-11-Kaiser Fung on how not to critique models
15 0.086894907 1165 andrew gelman stats-2012-02-13-Philosophy of Bayesian statistics: my reactions to Wasserman
16 0.086586714 2176 andrew gelman stats-2014-01-19-Transformations for non-normal data
17 0.085337788 1469 andrew gelman stats-2012-08-25-Ways of knowing
18 0.079023831 1292 andrew gelman stats-2012-05-01-Colorless green facts asserted resolutely
19 0.075273633 1149 andrew gelman stats-2012-02-01-Philosophy of Bayesian statistics: my reactions to Cox and Mayo
20 0.075101063 318 andrew gelman stats-2010-10-04-U-Haul statistics
topicId topicWeight
[(0, 0.122), (1, 0.026), (2, 0.009), (3, 0.012), (4, -0.017), (5, 0.01), (6, -0.006), (7, 0.044), (8, 0.056), (9, 0.002), (10, 0.005), (11, 0.0), (12, 0.002), (13, -0.0), (14, -0.04), (15, 0.038), (16, -0.033), (17, 0.013), (18, -0.009), (19, -0.001), (20, 0.04), (21, -0.03), (22, -0.002), (23, -0.002), (24, 0.014), (25, 0.062), (26, -0.011), (27, 0.027), (28, -0.008), (29, 0.062), (30, 0.045), (31, 0.022), (32, -0.005), (33, 0.025), (34, -0.023), (35, 0.001), (36, -0.033), (37, 0.003), (38, -0.054), (39, 0.006), (40, 0.067), (41, -0.062), (42, 0.031), (43, -0.026), (44, -0.004), (45, -0.013), (46, -0.015), (47, 0.004), (48, -0.002), (49, 0.013)]
simIndex simValue blogId blogTitle
same-blog 1 0.94780505 602 andrew gelman stats-2011-03-06-Assumptions vs. conditions
Introduction: Jeff Witmer writes: I noticed that you continue the standard practice in statistics of referring to assumptions; e.g. a blog entry on 2/4/11 at 10:54: “Our method, just like any model, relies on assumptions which we have the duty to state and to check.” I’m in the 6th year of a three-year campaign to get statisticians to drop the word “assumptions” and replace it with “conditions.” The problem, as I see it, is that people tend to think that an assumption is something that one assumes, as in “assuming that we have a right triangle…” or “assuming that k is even…” when constructing a mathematical proof. But in statistics we don’t assume things — unless we have to. Instead, we know that, for example, the validity of a t-test depends on normality, which is a condition that can and should be checked. Let’s not call normality an assumption, lest we imply that it is something that can be assumed. Let’s call it a condition. What do you all think?
2 0.72707349 1165 andrew gelman stats-2012-02-13-Philosophy of Bayesian statistics: my reactions to Wasserman
Introduction: Continuing with my discussion of the articles in the special issue of the journal Rationality, Markets and Morals on the philosophy of Bayesian statistics: Larry Wasserman, “Low Assumptions, High Dimensions”: This article was refreshing to me because it was so different from anything I’ve seen before. Larry works in a statistics department and I work in a statistics department but there’s so little overlap in what we do. Larry and I both work in high dimesions (maybe his dimensions are higher than mine, but a few thousand dimensions seems like a lot to me!), but there the similarity ends. His article is all about using few to no assumptions, while I use assumptions all the time. Here’s an example. Larry writes: P. Laurie Davies (and his co-workers) have written several interesting papers where probability models, at least in the sense that we usually use them, are eliminated. Data are treated as deterministic. One then looks for adequate models rather than true mode
3 0.7227357 603 andrew gelman stats-2011-03-07-Assumptions vs. conditions, part 2
Introduction: In response to the discussion of his remarks on assumptions vs. conditions, Jeff Witmer writes : If [certain conditions hold] , then the t-test p-value gives a remarkably good approximation to “the real thing” — namely the randomization reference p-value. . . . I [Witmer] make assumptions about conditions that I cannot check, e.g., that the data arose from a random sample. Of course, just as there is no such thing as a normal population, there is no such thing as a random sample. I disagree strongly with both the above paragraphs! I say this not to pick a fight with Jeff Witmer but to illustrate how, in statistics, even the most basic points that people take for granted, can’t be. Let’s take the claims in order: 1. The purpose of a t test is to approximate the randomization p-value. Not to me. In my world, the purpose of t tests and intervals is to summarize uncertainty in estimates and comparisons. I don’t care about a p-value and almost certainly don’t care a
4 0.66957867 1628 andrew gelman stats-2012-12-17-Statistics in a world where nothing is random
Introduction: Rama Ganesan writes: I think I am having an existential crisis. I used to work with animals (rats, mice, gerbils etc.) Then I started to work in marketing research where we did have some kind of random sampling procedure. So up until a few years ago, I was sort of okay. Now I am teaching marketing research, and I feel like there is no real random sampling anymore. I take pains to get students to understand what random means, and then the whole lot of inferential statistics. Then almost anything they do – the sample is not random. They think I am contradicting myself. They use convenience samples at every turn – for their school work, and the enormous amount on online surveying that gets done. Do you have any suggestions for me? Other than say, something like this . My reply: Statistics does not require randomness. The three essential elements of statistics are measurement, comparison, and variation. Randomness is one way to supply variation, and it’s one way to model
5 0.66859782 2128 andrew gelman stats-2013-12-09-How to model distributions that have outliers in one direction
Introduction: Shravan writes: I have a problem very similar to the one presented chapter 6 of BDA, the speed of light example. You use the distribution of the minimum scores from the posterior predictive distribution, show that it’s not realistic given the data, and suggest that an asymmetric contaminated normal distribution or a symmetric long-tailed distribution would be better. How does one use such a distribution? My reply: You can actually use a symmetric long-tailed distribution such as t with low degrees of freedom. One striking feature of symmetric long-tailed distributions is that a small random sample from such a distribution can have outliers on one side or the other and look asymmetric. Just to see this, try the following in R: par (mfrow=c(3,3), mar=c(1,1,1,1)) for (i in 1:9) hist (rt (100, 2), xlab="", ylab="", main="") You’ll see some skewed distributions. So that’s the message (which I learned from an offhand comment of Rubin, actually): if you want to model
6 0.66859251 1282 andrew gelman stats-2012-04-26-Bad news about (some) statisticians
8 0.64949751 1509 andrew gelman stats-2012-09-24-Analyzing photon counts
9 0.64387292 1527 andrew gelman stats-2012-10-10-Another reason why you can get good inferences from a bad model
12 0.62344128 1383 andrew gelman stats-2012-06-18-Hierarchical modeling as a framework for extrapolation
13 0.62025321 638 andrew gelman stats-2011-03-30-More on the correlation between statistical and political ideology
15 0.61279595 2115 andrew gelman stats-2013-11-27-Three unblinded mice
16 0.61047792 2258 andrew gelman stats-2014-03-21-Random matrices in the news
17 0.60930669 1645 andrew gelman stats-2012-12-31-Statistical modeling, causal inference, and social science
18 0.60926199 1849 andrew gelman stats-2013-05-09-Same old same old
20 0.60560745 2072 andrew gelman stats-2013-10-21-The future (and past) of statistical sciences
topicId topicWeight
[(16, 0.092), (18, 0.073), (24, 0.054), (35, 0.027), (47, 0.202), (63, 0.02), (86, 0.078), (99, 0.326)]
simIndex simValue blogId blogTitle
1 0.92952681 95 andrew gelman stats-2010-06-17-“Rewarding Strivers: Helping Low-Income Students Succeed in College”
Introduction: Several years ago, I heard about a project at the Educational Testing Service to identify “strivers”: students from disadvantaged backgrounds who did unexpectedly well on the SAT (the college admissions exam formerly known as the “Scholastic Aptitude Test” but apparently now just “the SAT,” in the same way that Exxon is just “Exxon” and that Harry Truman’s middle name is just “S”), at least 200 points above a predicted score based on demographic and neighborhood information. My ETS colleague and I agreed that this was a silly idea: From a statistical point of view, if student A is expected ahead of time to do better than student B, and then they get identical test scores, then you’d expect student A (the non-”striver”) to do better than student B (the “striver”) later on. Just basic statistics: if a student does much better than expected, then probably some of that improvement is noise. The idea of identifying these “strivers” seemed misguided and not the best use of the SAT.
Introduction: Adam Marcus at Retraction Watch reports on a physicist at the University of Toronto who had this unfortunate thing happen to him: This article has been retracted at the request of the Editor-in-Chief and first and corresponding author. The article was largely a duplication of a paper that had already appeared in ACS Nano, 4 (2010) 3374–3380, http://dx.doi.org/10.1021/nn100335g. The first and the corresponding authors (Kramer and Sargent) would like to apologize for this administrative error on their part . . . “Administrative error” . . . I love that! Is that what the robber says when he knocks over a liquor store and gets caught? As Marcus points out, the two papers have different titles and a different order of authors, which makes it less plausible that this was an administrative mistake (as could happen, for example, if a secretary was given a list of journals to submit the paper to, and accidentally submitted it to the second journal on the list without realizing it
same-blog 3 0.92603952 602 andrew gelman stats-2011-03-06-Assumptions vs. conditions
Introduction: Jeff Witmer writes: I noticed that you continue the standard practice in statistics of referring to assumptions; e.g. a blog entry on 2/4/11 at 10:54: “Our method, just like any model, relies on assumptions which we have the duty to state and to check.” I’m in the 6th year of a three-year campaign to get statisticians to drop the word “assumptions” and replace it with “conditions.” The problem, as I see it, is that people tend to think that an assumption is something that one assumes, as in “assuming that we have a right triangle…” or “assuming that k is even…” when constructing a mathematical proof. But in statistics we don’t assume things — unless we have to. Instead, we know that, for example, the validity of a t-test depends on normality, which is a condition that can and should be checked. Let’s not call normality an assumption, lest we imply that it is something that can be assumed. Let’s call it a condition. What do you all think?
4 0.92323768 1055 andrew gelman stats-2011-12-13-Data sharing update
Introduction: Fred Oswald reports that Sian Beilock sent him sufficient amounts of raw data from her research study so allow him to answer his questions about the large effects that were observed. This sort of collegiality is central to the collective scientific enterprise. The bad news is that IRB’s are still getting in the way. Beilock was very helpful but she had to work within the constraints of her IRB, which apparently advised her not to share data—even if de-identified—without getting lots more permissions. Oswald writes: It is a little concerning that the IRB bars the sharing of de-identified data, particularly in light of the specific guidelines of the journal Science, which appears to say that when you submit a study to the journal for publication, you are allowing for the sharing of de-identified data — unless you expressly say otherwise at the point that you submit the paper for consideration. Again, I don’t blame Beilock and Ramirez—they appear to have been as helpful as
5 0.92063016 2275 andrew gelman stats-2014-03-31-Just gave a talk
Introduction: I just gave a talk in Milan. Actually I was sitting at my desk, it was a g+ hangout which was a bit more convenient for me. The audience was a bunch of astronomers so I figured they could handle a satellite link. . . . Anyway, the talk didn’t go so well. Two reasons: first, it’s just hard to get the connection with the audience without being able to see their faces. Next time I think I’ll try to get several people in the audience to open up their laptops and connect to the hangout, so that I can see a mosaic of faces instead of just a single image from the front of the room. The second problem with the talk was the topic. I asked the people who invited me to choose a topic, and they picked Can we use Bayesian methods to resolve the current crisis of statistically-significant research findings that don’t hold up? But I don’t think this was right for this audience. I think that it would’ve been better to give them the Stan talk or the little data talk or the statistic
6 0.91815937 1143 andrew gelman stats-2012-01-29-G+ > Skype
7 0.91127789 1261 andrew gelman stats-2012-04-12-The Naval Research Lab
8 0.90828937 1050 andrew gelman stats-2011-12-10-Presenting at the econ seminar
10 0.89451331 1668 andrew gelman stats-2013-01-11-My talk at the NY data visualization meetup this Monday!
11 0.89446396 2131 andrew gelman stats-2013-12-12-My talk at Leuven, Sat 14 Dec
12 0.89443147 275 andrew gelman stats-2010-09-14-Data visualization at the American Evaluation Association
13 0.87901062 2068 andrew gelman stats-2013-10-18-G+ hangout for Bayesian Data Analysis course now! (actually, in 5 minutes)
14 0.87822366 1349 andrew gelman stats-2012-05-28-Question 18 of my final exam for Design and Analysis of Sample Surveys
15 0.87641454 2100 andrew gelman stats-2013-11-14-BDA class G+ hangout another try
16 0.87631309 716 andrew gelman stats-2011-05-17-Is the internet causing half the rapes in Norway? I wanna see the scatterplot.
17 0.87351269 438 andrew gelman stats-2010-11-30-I just skyped in from Kentucky, and boy are my arms tired
18 0.87106335 2175 andrew gelman stats-2014-01-18-A course in sample surveys for political science
20 0.86908066 1450 andrew gelman stats-2012-08-08-My upcoming talk for the data visualization meetup