andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-1760 knowledge-graph by maker-knowledge-mining

1760 andrew gelman stats-2013-03-12-Misunderstanding the p-value


meta infos for this blog

Source: html

Introduction: The New York Times has a feature in its Tuesday science section, Take a Number, to which I occasionally contribute (see here and here ). Today’s column , by Nicholas Balakar, is in error. The column begins: When medical researchers report their findings, they need to know whether their result is a real effect of what they are testing, or just a random occurrence. To figure this out, they most commonly use the p-value. This is wrong on two counts. First, whatever researchers might feel, this is something they’ll never know. Second, results are a combination of real effects and chance, it’s not either/or. Perhaps the above is a forgivable simplification, but I don’t think so; I think it’s a simplification that destroys the reason for writing the article in the first place. But in any case I think there’s no excuse for this, later on: By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The New York Times has a feature in its Tuesday science section, Take a Number, to which I occasionally contribute (see here and here ). [sent-1, score-0.171]

2 Today’s column , by Nicholas Balakar, is in error. [sent-2, score-0.215]

3 The column begins: When medical researchers report their findings, they need to know whether their result is a real effect of what they are testing, or just a random occurrence. [sent-3, score-0.565]

4 To figure this out, they most commonly use the p-value. [sent-4, score-0.09]

5 First, whatever researchers might feel, this is something they’ll never know. [sent-6, score-0.087]

6 Second, results are a combination of real effects and chance, it’s not either/or. [sent-7, score-0.2]

7 Perhaps the above is a forgivable simplification, but I don’t think so; I think it’s a simplification that destroys the reason for writing the article in the first place. [sent-8, score-0.384]

8 But in any case I think there’s no excuse for this, later on: By convention, a p-value higher than 0. [sent-9, score-0.096]

9 05 usually indicates that the results of the study, however good or bad, were probably due only to chance. [sent-10, score-0.405]

10 This is the old, old error of confusing p(A|B) with p(B|A). [sent-11, score-0.301]

11 I’m too rushed right now to explain this one, but it’s in just about every introductory statistics textbook ever written. [sent-12, score-0.309]

12 The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations). [sent-14, score-1.379]

13 I can’t get too annoyed at science writer Bakalar for garbling the point—it confuses lots and lots of people—but, still, I hate to see this error in the newspaper. [sent-18, score-0.486]

14 On the plus side, if a newspaper column runs 20 times, I guess it’s ok for it to be wrong once—we still have 95% confidence in it, right? [sent-19, score-0.431]

15 Various commenters remark that it’s not so easy to define p-values accurately. [sent-22, score-0.083]

16 I agree, and I think it’s for reasons described in my quote immediately above: the formal view of the p-value is mathematically correct but typically irrelevant to research goals. [sent-23, score-1.003]

17 Phil nails it: The p-value does not tell you if the result was due to chance. [sent-27, score-0.417]

18 It tells you whether the results are consistent with being due to chance. [sent-28, score-0.401]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('simplification', 0.249), ('column', 0.215), ('due', 0.2), ('irrelevant', 0.187), ('mathematically', 0.185), ('view', 0.172), ('null', 0.161), ('formal', 0.16), ('begins', 0.157), ('rushed', 0.135), ('destroys', 0.135), ('wrong', 0.133), ('confuses', 0.129), ('results', 0.119), ('nails', 0.117), ('correct', 0.113), ('persists', 0.112), ('typically', 0.109), ('nicholas', 0.108), ('old', 0.108), ('greenland', 0.107), ('convention', 0.105), ('value', 0.104), ('result', 0.1), ('error', 0.1), ('tuesday', 0.098), ('popularity', 0.098), ('misunderstanding', 0.098), ('excuse', 0.096), ('confusing', 0.093), ('interpretations', 0.093), ('times', 0.091), ('probability', 0.09), ('introductory', 0.09), ('commonly', 0.09), ('contribute', 0.09), ('casual', 0.09), ('annoyed', 0.089), ('researchers', 0.087), ('indicates', 0.086), ('textbook', 0.084), ('lots', 0.084), ('remark', 0.083), ('runs', 0.083), ('whether', 0.082), ('real', 0.081), ('occasionally', 0.081), ('immediately', 0.077), ('phil', 0.077), ('valid', 0.077)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 1760 andrew gelman stats-2013-03-12-Misunderstanding the p-value

Introduction: The New York Times has a feature in its Tuesday science section, Take a Number, to which I occasionally contribute (see here and here ). Today’s column , by Nicholas Balakar, is in error. The column begins: When medical researchers report their findings, they need to know whether their result is a real effect of what they are testing, or just a random occurrence. To figure this out, they most commonly use the p-value. This is wrong on two counts. First, whatever researchers might feel, this is something they’ll never know. Second, results are a combination of real effects and chance, it’s not either/or. Perhaps the above is a forgivable simplification, but I don’t think so; I think it’s a simplification that destroys the reason for writing the article in the first place. But in any case I think there’s no excuse for this, later on: By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only

2 0.28582358 1713 andrew gelman stats-2013-02-08-P-values and statistical practice

Introduction: From my new article in the journal Epidemiology: Sander Greenland and Charles Poole accept that P values are here to stay but recognize that some of their most common interpretations have problems. The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings (as discussed, for example, by Greenland in 2011). The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations). A Bayesian interpretation based on a spike-and-slab model makes little sense in applied contexts in epidemiology, political science, and other fields in which true effects are typically nonzero and bounded (thus violating both the “spike” and the “slab” parts of the model). I find Greenland and Poole’s perspective t

3 0.18908873 256 andrew gelman stats-2010-09-04-Noooooooooooooooooooooooooooooooooooooooooooooooo!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Introduction: Masanao sends this one in, under the heading, “another incident of misunderstood p-value”: Warren Davies, a positive psychology MSc student at UEL, provides the latest in our ongoing series of guest features for students. Warren has just released a Psychology Study Guide, which covers information on statistics, research methods and study skills for psychology students. Despite the myriad rules and procedures of science, some research findings are pure flukes. Perhaps you’re testing a new drug, and by chance alone, a large number of people spontaneously get better. The better your study is conducted, the lower the chance that your result was a fluke – but still, there is always a certain probability that it was. Statistical significance testing gives you an idea of what this probability is. In science we’re always testing hypotheses. We never conduct a study to ‘see what happens’, because there’s always at least one way to make any useless set of data look important. We take

4 0.14880459 2295 andrew gelman stats-2014-04-18-One-tailed or two-tailed?

Introduction: Someone writes: Suppose I have two groups of people, A and B, which differ on some characteristic of interest to me; and for each person I measure a single real-valued quantity X. I have a theory that group A has a higher mean value of X than group B. I test this theory by using a t-test. Am I entitled to use a *one-tailed* t-test? Or should I use a *two-tailed* one (thereby giving a p-value that is twice as large)? I know you will probably answer: Forget the t-test; you should use Bayesian methods instead. But what is the standard frequentist answer to this question? My reply: The quick answer here is that different people will do different things here. I would say the 2-tailed p-value is more standard but some people will insist on the one-tailed version, and it’s hard to make a big stand on this one, given all the other problems with p-values in practice: http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf http://www.stat.columbia.edu/~gelm

5 0.14747837 2107 andrew gelman stats-2013-11-20-NYT (non)-retraction watch

Introduction: Mark Palko is irritated by the Times’s refusal to retract a recounting of a hoax regarding Dickens and Dostoevsky. All I can say is, the Times refuses to retract mistakes of fact that are far more current than that! See here for two examples that particularly annoyed me, to the extent that I contacted various people at the Times but ran into refusals to retract. I guess a daily newspaper publishes so much material that they can’t be expected to run a retraction every time they publish something false, even when such things are brought to their attention. Speaking of corrections, I wonder if later editions of the Samuelson economics textbook discussed their notorious graph predicting Soviet economic performance. The easiest thing would be just to remove the graph, but I think it would be a better economics lesson to discuss the error! Similarly, I think the NYT would do well to run an article on their Dickens-Dostoevsky mistake, along with a column by Arthur Brooks on how

6 0.14020213 1826 andrew gelman stats-2013-04-26-“A Vast Graveyard of Undead Theories: Publication Bias and Psychological Science’s Aversion to the Null”

7 0.13289271 2127 andrew gelman stats-2013-12-08-The never-ending (and often productive) race between theory and practice

8 0.13244548 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?

9 0.11491492 2149 andrew gelman stats-2013-12-26-Statistical evidence for revised standards

10 0.11400396 1791 andrew gelman stats-2013-04-07-Scatterplot charades!

11 0.11372195 1117 andrew gelman stats-2012-01-13-What are the important issues in ethics and statistics? I’m looking for your input!

12 0.11348516 2281 andrew gelman stats-2014-04-04-The Notorious N.H.S.T. presents: Mo P-values Mo Problems

13 0.11296377 506 andrew gelman stats-2011-01-06-That silly ESP paper and some silliness in a rebuttal as well

14 0.1087053 1878 andrew gelman stats-2013-05-31-How to fix the tabloids? Toward replicable social science research

15 0.10807255 2029 andrew gelman stats-2013-09-18-Understanding posterior p-values

16 0.10491271 1355 andrew gelman stats-2012-05-31-Lindley’s paradox

17 0.10453768 1605 andrew gelman stats-2012-12-04-Write This Book

18 0.1044345 2183 andrew gelman stats-2014-01-23-Discussion on preregistration of research studies

19 0.1036144 291 andrew gelman stats-2010-09-22-Philosophy of Bayes and non-Bayes: A dialogue with Deborah Mayo

20 0.10323818 2263 andrew gelman stats-2014-03-24-Empirical implications of Empirical Implications of Theoretical Models


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.233), (1, 0.01), (2, -0.004), (3, -0.094), (4, -0.055), (5, -0.057), (6, 0.03), (7, 0.023), (8, 0.034), (9, -0.085), (10, -0.069), (11, 0.033), (12, -0.005), (13, -0.068), (14, -0.032), (15, -0.009), (16, -0.039), (17, -0.025), (18, 0.014), (19, -0.045), (20, 0.054), (21, -0.01), (22, 0.008), (23, -0.008), (24, -0.055), (25, -0.01), (26, -0.012), (27, 0.059), (28, -0.001), (29, -0.059), (30, -0.007), (31, 0.006), (32, 0.006), (33, -0.014), (34, -0.029), (35, -0.048), (36, 0.06), (37, -0.036), (38, 0.022), (39, -0.081), (40, -0.017), (41, -0.05), (42, 0.025), (43, -0.008), (44, 0.01), (45, 0.052), (46, -0.018), (47, 0.002), (48, 0.042), (49, 0.024)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97125947 1760 andrew gelman stats-2013-03-12-Misunderstanding the p-value

Introduction: The New York Times has a feature in its Tuesday science section, Take a Number, to which I occasionally contribute (see here and here ). Today’s column , by Nicholas Balakar, is in error. The column begins: When medical researchers report their findings, they need to know whether their result is a real effect of what they are testing, or just a random occurrence. To figure this out, they most commonly use the p-value. This is wrong on two counts. First, whatever researchers might feel, this is something they’ll never know. Second, results are a combination of real effects and chance, it’s not either/or. Perhaps the above is a forgivable simplification, but I don’t think so; I think it’s a simplification that destroys the reason for writing the article in the first place. But in any case I think there’s no excuse for this, later on: By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only

2 0.82443053 1883 andrew gelman stats-2013-06-04-Interrogating p-values

Introduction: This article is a discussion of a paper by Greg Francis for a special issue, edited by E. J. Wagenmakers, of the Journal of Mathematical Psychology. Here’s what I wrote: Much of statistical practice is an effort to reduce or deny variation and uncertainty. The reduction is done through standardization, replication, and other practices of experimental design, with the idea being to isolate and stabilize the quantity being estimated and then average over many cases. Even so, however, uncertainty persists, and statistical hypothesis testing is in many ways an endeavor to deny this, by reporting binary accept/reject decisions. Classical statistical methods produce binary statements, but there is no reason to assume that the world works that way. Expressions such as Type 1 error, Type 2 error, false positive, and so on, are based on a model in which the world is divided into real and non-real effects. To put it another way, I understand the general scientific distinction of real vs

3 0.80890763 2102 andrew gelman stats-2013-11-15-“Are all significant p-values created equal?”

Introduction: The answer is no, as explained in this classic article by Warren Browner and Thomas Newman from 1987. If I were to rewrite this article today, I would frame things slightly differently—referring to Type S and Type M errors rather than speaking of “the probability that the research hypothesis is true”—but overall they make good points, and I like their analogy to medical diagnostic testing.

4 0.79953301 256 andrew gelman stats-2010-09-04-Noooooooooooooooooooooooooooooooooooooooooooooooo!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Introduction: Masanao sends this one in, under the heading, “another incident of misunderstood p-value”: Warren Davies, a positive psychology MSc student at UEL, provides the latest in our ongoing series of guest features for students. Warren has just released a Psychology Study Guide, which covers information on statistics, research methods and study skills for psychology students. Despite the myriad rules and procedures of science, some research findings are pure flukes. Perhaps you’re testing a new drug, and by chance alone, a large number of people spontaneously get better. The better your study is conducted, the lower the chance that your result was a fluke – but still, there is always a certain probability that it was. Statistical significance testing gives you an idea of what this probability is. In science we’re always testing hypotheses. We never conduct a study to ‘see what happens’, because there’s always at least one way to make any useless set of data look important. We take

5 0.7958675 2281 andrew gelman stats-2014-04-04-The Notorious N.H.S.T. presents: Mo P-values Mo Problems

Introduction: A recent discussion between commenters Question and Fernando captured one of the recurrent themes here from the past year. Question: The problem is simple, the researchers are disproving always false null hypotheses and taking this disproof as near proof that their theory is correct. Fernando: Whereas it is probably true that researchers misuse NHT, the problem with tabloid science is broader and deeper. It is systemic. Question: I do not see how anything can be deeper than replacing careful description, prediction, falsification, and independent replication with dynamite plots, p-values, affirming the consequent, and peer review. From my own experience I am confident in saying that confusion caused by NHST is at the root of this problem. Fernando: Incentives? Impact factors? Publish or die? “Interesting” and “new” above quality and reliability, or actually answering a research question, and a silly and unbecoming obsession with being quoted in NYT, etc. . . . Giv

6 0.78872687 1826 andrew gelman stats-2013-04-26-“A Vast Graveyard of Undead Theories: Publication Bias and Psychological Science’s Aversion to the Null”

7 0.7760365 2149 andrew gelman stats-2013-12-26-Statistical evidence for revised standards

8 0.77518296 1713 andrew gelman stats-2013-02-08-P-values and statistical practice

9 0.76453125 1355 andrew gelman stats-2012-05-31-Lindley’s paradox

10 0.74544001 2140 andrew gelman stats-2013-12-19-Revised evidence for statistical standards

11 0.74475259 2040 andrew gelman stats-2013-09-26-Difficulties in making inferences about scientific truth from distributions of published p-values

12 0.7370798 2093 andrew gelman stats-2013-11-07-I’m negative on the expression “false positives”

13 0.73477376 1195 andrew gelman stats-2012-03-04-Multiple comparisons dispute in the tabloids

14 0.7331847 506 andrew gelman stats-2011-01-06-That silly ESP paper and some silliness in a rebuttal as well

15 0.72941864 2243 andrew gelman stats-2014-03-11-The myth of the myth of the myth of the hot hand

16 0.72630972 2272 andrew gelman stats-2014-03-29-I agree with this comment

17 0.7237556 1861 andrew gelman stats-2013-05-17-Where do theories come from?

18 0.71692801 2183 andrew gelman stats-2014-01-23-Discussion on preregistration of research studies

19 0.71401644 2295 andrew gelman stats-2014-04-18-One-tailed or two-tailed?

20 0.71179521 54 andrew gelman stats-2010-05-27-Hype about conditional probability puzzles


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(15, 0.015), (16, 0.124), (18, 0.014), (21, 0.033), (24, 0.206), (42, 0.05), (53, 0.024), (86, 0.037), (89, 0.015), (94, 0.066), (99, 0.303)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97652936 1760 andrew gelman stats-2013-03-12-Misunderstanding the p-value

Introduction: The New York Times has a feature in its Tuesday science section, Take a Number, to which I occasionally contribute (see here and here ). Today’s column , by Nicholas Balakar, is in error. The column begins: When medical researchers report their findings, they need to know whether their result is a real effect of what they are testing, or just a random occurrence. To figure this out, they most commonly use the p-value. This is wrong on two counts. First, whatever researchers might feel, this is something they’ll never know. Second, results are a combination of real effects and chance, it’s not either/or. Perhaps the above is a forgivable simplification, but I don’t think so; I think it’s a simplification that destroys the reason for writing the article in the first place. But in any case I think there’s no excuse for this, later on: By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only

2 0.96865487 2179 andrew gelman stats-2014-01-20-The AAA Tranche of Subprime Science

Introduction: In our new ethics column for Chance , Eric Loken and I write about our current favorite topic: One of our ongoing themes when discussing scientific ethics is the central role of statistics in recognizing and communicating uncer- tainty. Unfortunately, statistics—and the scientific process more generally—often seems to be used more as a way of laundering uncertainty, processing data until researchers and consumers of research can feel safe acting as if various scientific hypotheses are unquestionably true. . . . We have in mind an analogy with the notorious AAA-class bonds created during the mid-2000s that led to the subprime mortgage crisis. Lower-quality mortgages—that is, mortgages with high probability of default and, thus, high uncertainty—were packaged and transformed into financial instruments that were (in retrospect, falsely) characterized as low risk. There was a tremendous interest in these securities, not just among the most unscrupulous market manipulators, but in a

3 0.9658339 1881 andrew gelman stats-2013-06-03-Boot

Introduction: Joshua Hartshorne writes: I ran several large-N experiments (separate participants) and looked at performance against age. What we want to do is compare age-of-peak-performance across the different tasks (again, different participants). We bootstrapped age-of-peak-performance. On each iteration, we sampled (with replacement) the X scores at each age, where X=num of participants at that age, and recorded the age at which performance peaked on that task. We then recorded the age at which performance was at peak and repeated. Once we had distributions of age-of-peak-performance, we used the means and SDs to calculate t-statistics to compare the results across different tasks. For graphical presentation, we used medians, interquartile ranges, and 95% confidence intervals (based on the distributions: the range within which 75% and 95% of the bootstrapped peaks appeared). While a number of people we consulted with thought this made a lot of sense, one reviewer of the paper insist

4 0.96539766 807 andrew gelman stats-2011-07-17-Macro causality

Introduction: David Backus writes: This is from my area of work, macroeconomics. The suggestion here is that the economy is growing slowly because consumers aren’t spending money. But how do we know it’s not the reverse: that consumers are spending less because the economy isn’t doing well. As a teacher, I can tell you that it’s almost impossible to get students to understand that the first statement isn’t obviously true. What I’d call the demand-side story (more spending leads to more output) is everywhere, including this piece, from the usually reliable David Leonhardt. This whole situation reminds me of the story of the village whose inhabitants support themselves by taking in each others’ laundry. I guess we’re rich enough in the U.S. that we can stay afloat for a few decades just buying things from each other? Regarding the causal question, I’d like to move away from the idea of “Does A causes B or does B cause A” and toward a more intervention-based framework (Rubin’s model for

5 0.96458268 2040 andrew gelman stats-2013-09-26-Difficulties in making inferences about scientific truth from distributions of published p-values

Introduction: Jeff Leek just posted the discussions of his paper (with Leah Jager), “An estimate of the science-wise false discovery rate and application to the top medical literature,” along with some further comments of his own. Here are my original thoughts on an earlier version of their article. Keith O’Rourke and I expanded these thoughts into a formal comment for the journal. We’re pretty much in agreement with John Ioannidis (you can find his discussion in the top link above). In quick summary, I agree with Jager and Leek that this is an important topic. I think there are two key places where Keith and I disagree with them: 1. They take published p-values at face value whereas we consider them as the result of a complicated process of selection. This is something I didn’t used to think much about, but now I’ve become increasingly convinced that the problems with published p-values is not a simple file-drawer effect or the case of a few p=0.051 values nudged toward p=0.049, bu

6 0.96345127 898 andrew gelman stats-2011-09-10-Fourteen magic words: an update

7 0.9627341 447 andrew gelman stats-2010-12-03-Reinventing the wheel, only more so.

8 0.96214813 799 andrew gelman stats-2011-07-13-Hypothesis testing with multiple imputations

9 0.9614011 1206 andrew gelman stats-2012-03-10-95% intervals that I don’t believe, because they’re from a flat prior I don’t believe

10 0.9608084 2149 andrew gelman stats-2013-12-26-Statistical evidence for revised standards

11 0.96079481 1019 andrew gelman stats-2011-11-19-Validation of Software for Bayesian Models Using Posterior Quantiles

12 0.9602114 2201 andrew gelman stats-2014-02-06-Bootstrap averaging: Examples where it works and where it doesn’t work

13 0.9597562 586 andrew gelman stats-2011-02-23-A statistical version of Arrow’s paradox

14 0.95874798 615 andrew gelman stats-2011-03-16-Chess vs. checkers

15 0.95776469 488 andrew gelman stats-2010-12-27-Graph of the year

16 0.9576 503 andrew gelman stats-2011-01-04-Clarity on my email policy

17 0.95736086 1422 andrew gelman stats-2012-07-20-Likelihood thresholds and decisions

18 0.95724618 1218 andrew gelman stats-2012-03-18-Check your missing-data imputations using cross-validation

19 0.95718241 899 andrew gelman stats-2011-09-10-The statistical significance filter

20 0.95705545 639 andrew gelman stats-2011-03-31-Bayes: radical, liberal, or conservative?