andrew_gelman_stats andrew_gelman_stats-2012 andrew_gelman_stats-2012-1150 knowledge-graph by maker-knowledge-mining

1150 andrew gelman stats-2012-02-02-The inevitable problems with statistical significance and 95% intervals


meta infos for this blog

Source: html

Introduction: I’m thinking more and more that we have to get rid of statistical significance, 95% intervals, and all the rest, and just come to a more fundamental acceptance of uncertainty. In practice, I think we use confidence intervals and hypothesis tests as a way to avoid acknowledging uncertainty. We set up some rules and then act as if we know what is real and what is not. Even in my own applied work, I’ve often enough presented 95% intervals and gone on from there. But maybe that’s just not right. I was thinking about this after receiving the following email from a psychology student: I [the student] am trying to conceptualize the lessons in your paper with Stern with comparing treatment effects across studies. When trying to understand if a certain intervention works, we must look at what the literature says. However this can be complicated if the literature has divergent results. There are four situations I am thinking of. FOr each of these situations, assume the studies are r


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 I was thinking about this after receiving the following email from a psychology student: I [the student] am trying to conceptualize the lessons in your paper with Stern with comparing treatment effects across studies. [sent-6, score-0.641]

2 FOr each of these situations, assume the studies are randomized control designs with the same treatment and outcome measures, and each situation refers to a different treatment. [sent-10, score-0.968]

3 In each of these situations only 1 of 2 published studies is found to be statistically significant. [sent-12, score-0.564]

4 Effect se Sig Sig in diff Result Situation 1      Study A . [sent-13, score-0.218]

5 1 Y Y Unclear, needs more replications      Study D . [sent-19, score-0.246]

6 2 Y X Unclear, needs more replications      Study F . [sent-23, score-0.246]

7 1 Here, Situation 1 refers to 2 studies that have similar effects in magnitude, though the larger of the 2 studies (smaller se) is the only sig one. [sent-29, score-1.054]

8 SInce the difference between the two effects is itself, not statistically significant, we should conclude treatment in situation 1 is effective (this seems to be in line with your paper). [sent-30, score-1.209]

9 In situation 2 there are 2 equally sized experiments that differ in treatment effect and significance. [sent-31, score-0.858]

10 Since the difference between the estimates is statistically significant, one concludes the paradigm needs more replications. [sent-32, score-0.348]

11 In situation 3 the 2 studies have 2 effects, one is statistically significant while the other is not. [sent-33, score-0.959]

12 However in this situation study F is neither statistically nor substantively significant. [sent-34, score-0.945]

13 Unlike situation 1 it would seem unwise to conclude Treatment in situation 3 is effective and we need more replications. [sent-35, score-1.212]

14 Situation 4 is just some result I cam across in a research synthesis, where a smaller study (larger se) had a statistically sig effect, but a larger one did not. [sent-36, score-0.992]

15 It would seem in this situation the true effect is null and the stat sig effect is a type 1 error. [sent-37, score-1.228]

16 However the difference between studies is not stat sig, would this matter? [sent-38, score-0.348]

17 With only two studies, your inference will necessarily depend on your prior information about effectiveness and variation of the treatments. [sent-40, score-0.263]

18 In addition, the hypothetical situations I sent you are sometimes all we know about the effectiveness and variation in treatments, because it is all the evidence we have. [sent-43, score-0.396]

19 What I am trying to better understand is if your paper is addressing situation 1 ONLY, or if it is making inferences or statements about the evidence in the other situations I presented. [sent-44, score-0.885]

20 In a decision problem, I think ultimately it’s necessary to bite the bullet and decide what prior information you have on effectiveness rather than relying on statistical significance. [sent-46, score-0.248]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('situation', 0.515), ('sig', 0.421), ('situations', 0.199), ('study', 0.195), ('studies', 0.191), ('treatment', 0.178), ('statistically', 0.174), ('se', 0.155), ('replications', 0.145), ('effectiveness', 0.134), ('intervals', 0.112), ('unclear', 0.105), ('effect', 0.104), ('effective', 0.103), ('needs', 0.101), ('effects', 0.087), ('student', 0.086), ('thinking', 0.084), ('refers', 0.084), ('stat', 0.084), ('larger', 0.08), ('conclude', 0.079), ('significant', 0.079), ('difference', 0.073), ('smaller', 0.072), ('however', 0.071), ('conceptualize', 0.07), ('reaction', 0.069), ('necessarily', 0.066), ('replied', 0.065), ('diff', 0.063), ('divergent', 0.063), ('variation', 0.063), ('sized', 0.061), ('substantively', 0.061), ('trying', 0.061), ('synthesis', 0.059), ('bite', 0.059), ('paper', 0.058), ('practice', 0.055), ('stern', 0.055), ('bullet', 0.055), ('acknowledging', 0.054), ('lessons', 0.053), ('addressing', 0.052), ('encourages', 0.052), ('literature', 0.052), ('easiest', 0.052), ('deterministic', 0.05), ('across', 0.05)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 1150 andrew gelman stats-2012-02-02-The inevitable problems with statistical significance and 95% intervals

Introduction: I’m thinking more and more that we have to get rid of statistical significance, 95% intervals, and all the rest, and just come to a more fundamental acceptance of uncertainty. In practice, I think we use confidence intervals and hypothesis tests as a way to avoid acknowledging uncertainty. We set up some rules and then act as if we know what is real and what is not. Even in my own applied work, I’ve often enough presented 95% intervals and gone on from there. But maybe that’s just not right. I was thinking about this after receiving the following email from a psychology student: I [the student] am trying to conceptualize the lessons in your paper with Stern with comparing treatment effects across studies. When trying to understand if a certain intervention works, we must look at what the literature says. However this can be complicated if the literature has divergent results. There are four situations I am thinking of. FOr each of these situations, assume the studies are r

2 0.15839621 1072 andrew gelman stats-2011-12-19-“The difference between . . .”: It’s not just p=.05 vs. p=.06

Introduction: The title of this post by Sanjay Srivastava illustrates an annoying misconception that’s crept into the (otherwise delightful) recent publicity related to my article with Hal Stern, he difference between “significant” and “not significant” is not itself statistically significant. When people bring this up, they keep referring to the difference between p=0.05 and p=0.06, making the familiar (and correct) point about the arbitrariness of the conventional p-value threshold of 0.05. And, sure, I agree with this, but everybody knows that already. The point Hal and I were making was that even apparently large differences in p-values are not statistically significant. For example, if you have one study with z=2.5 (almost significant at the 1% level!) and another with z=1 (not statistically significant at all, only 1 se from zero!), then their difference has a z of about 1 (again, not statistically significant at all). So it’s not just a comparison of 0.05 vs. 0.06, even a differenc

3 0.14794196 1891 andrew gelman stats-2013-06-09-“Heterogeneity of variance in experimental studies: A challenge to conventional interpretations”

Introduction: Avi sent along this old paper from Bryk and Raudenbush, who write: The presence of heterogeneity of variance across groups indicates that the standard statistical model for treatment effects no longer applies. Specifically, the assumption that treatments add a constant to each subject’s development fails. An alternative model is required to represent how treatment effects are distributed across individuals. We develop in this article a simple statistical model to demonstrate the link between heterogeneity of variance and random treatment effects. Next, we illustrate with results from two previously published studies how a failure to recognize the substantive importance of heterogeneity of variance obscured significant results present in these data. The article concludes with a review and synthesis of techniques for modeling variances. Although these methods have been well established in the statistical literature, they are not widely known by social and behavioral scientists. T

4 0.1380899 2042 andrew gelman stats-2013-09-28-Difficulties of using statistical significance (or lack thereof) to sift through and compare research hypotheses

Introduction: Dean Eckles writes: Thought you might be interested in an example that touches on a couple recurring topics: 1. The difference between a statistically significant finding and one that is non-significant need not be itself statistically significant (thus highlighting the problems of using NHST to declare whether an effect exists or not). 2. Continued issues with the credibility of high profile studies of “social contagion”, especially by Christakis and Fowler . A new paper in Archives of Sexual Behavior produces observational estimates of peer effects in sexual behavior and same-sex attraction. In the text, the authors (who include C&F;) make repeated comparisons of the results for peer effects in sexual intercourse and those for peer effects in same-sex attraction. However, the 95% CI for the later actually includes the point estimate for the former! This is most clear in Figure 2, as highlighted by Real Clear Science’s blog post about the study. (Now because there is som

5 0.12903504 1310 andrew gelman stats-2012-05-09-Varying treatment effects, again

Introduction: This time from Bernard Fraga and Eitan Hersh. Once you think about it, it’s hard to imagine any nonzero treatment effects that don’t vary. I’m glad to see this area of research becoming more prominent. ( Here ‘s a discussion of another political science example, also of voter turnout, from a few years ago, from Avi Feller and Chris Holmes.) Some of my fragmentary work on varying treatment effects is here (Treatment Effects in Before-After Data) and here (Estimating Incumbency Advantage and Its Variation, as an Example of a Before–After Study).

6 0.12818775 7 andrew gelman stats-2010-04-27-Should Mister P be allowed-encouraged to reside in counter-factual populations?

7 0.12434614 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes

8 0.12355409 753 andrew gelman stats-2011-06-09-Allowing interaction terms to vary

9 0.12057535 936 andrew gelman stats-2011-10-02-Covariate Adjustment in RCT - Model Overfitting in Multilevel Regression

10 0.12005193 1149 andrew gelman stats-2012-02-01-Philosophy of Bayesian statistics: my reactions to Cox and Mayo

11 0.1178161 1209 andrew gelman stats-2012-03-12-As a Bayesian I want scientists to report their data non-Bayesianly

12 0.11776681 1756 andrew gelman stats-2013-03-10-He said he was sorry

13 0.11600684 1523 andrew gelman stats-2012-10-06-Comparing people from two surveys, one of which is a simple random sample and one of which is not

14 0.11558171 899 andrew gelman stats-2011-09-10-The statistical significance filter

15 0.1150852 2210 andrew gelman stats-2014-02-13-Stopping rules and Bayesian analysis

16 0.11361005 1355 andrew gelman stats-2012-05-31-Lindley’s paradox

17 0.10672135 803 andrew gelman stats-2011-07-14-Subtleties with measurement-error models for the evaluation of wacky claims

18 0.10646442 1955 andrew gelman stats-2013-07-25-Bayes-respecting experimental design and other things

19 0.10454377 1206 andrew gelman stats-2012-03-10-95% intervals that I don’t believe, because they’re from a flat prior I don’t believe

20 0.10446102 1883 andrew gelman stats-2013-06-04-Interrogating p-values


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.181), (1, 0.047), (2, 0.045), (3, -0.203), (4, -0.016), (5, -0.064), (6, 0.008), (7, 0.043), (8, -0.015), (9, -0.019), (10, -0.056), (11, 0.03), (12, 0.078), (13, -0.092), (14, 0.056), (15, 0.004), (16, -0.011), (17, 0.002), (18, -0.017), (19, 0.034), (20, -0.0), (21, 0.029), (22, 0.004), (23, -0.021), (24, 0.027), (25, -0.002), (26, -0.004), (27, -0.029), (28, -0.072), (29, -0.0), (30, -0.043), (31, -0.046), (32, -0.018), (33, 0.034), (34, 0.022), (35, 0.067), (36, -0.052), (37, -0.009), (38, 0.011), (39, 0.007), (40, 0.029), (41, -0.012), (42, 0.022), (43, 0.037), (44, 0.07), (45, -0.026), (46, -0.003), (47, -0.024), (48, 0.04), (49, 0.025)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98547494 1150 andrew gelman stats-2012-02-02-The inevitable problems with statistical significance and 95% intervals

Introduction: I’m thinking more and more that we have to get rid of statistical significance, 95% intervals, and all the rest, and just come to a more fundamental acceptance of uncertainty. In practice, I think we use confidence intervals and hypothesis tests as a way to avoid acknowledging uncertainty. We set up some rules and then act as if we know what is real and what is not. Even in my own applied work, I’ve often enough presented 95% intervals and gone on from there. But maybe that’s just not right. I was thinking about this after receiving the following email from a psychology student: I [the student] am trying to conceptualize the lessons in your paper with Stern with comparing treatment effects across studies. When trying to understand if a certain intervention works, we must look at what the literature says. However this can be complicated if the literature has divergent results. There are four situations I am thinking of. FOr each of these situations, assume the studies are r

2 0.78463823 2227 andrew gelman stats-2014-02-27-“What Can we Learn from the Many Labs Replication Project?”

Introduction: Aki points us to this discussion from Rolf Zwaan: The first massive replication project in psychology has just reached completion (several others are to follow). . . . What can we learn from the ManyLabs project? The results here show the effect sizes for the replication efforts (in green and grey) as well as the original studies (in blue). The 99% confidence intervals are for the meta-analysis of the effect size (the green dots); the studies are ordered by effect size. Let’s first consider what we canNOT learn from these data. Of the 13 replication attempts (when the first four are taken together), 11 succeeded and 2 did not (in fact, at some point ManyLabs suggests that a third one, Imagined Contact also doesn’t really replicate). We cannot learn from this that the vast majority of psychological findings will replicate . . . But even if we had an accurate estimate of the percentage of findings that replicate, how useful would that be? Rather than trying to arrive at a mo

3 0.7725057 7 andrew gelman stats-2010-04-27-Should Mister P be allowed-encouraged to reside in counter-factual populations?

Introduction: Lets say you are repeatedly going to recieve unselected sets of well done RCTs on various say medical treatments. One reasonable assumption with all of these treatments is that they are monotonic – either helpful or harmful for all. The treatment effect will (as always) vary for subgroups in the population – these will not be explicitly identified in the studies – but each study very likely will enroll different percentages of the variuos patient subgroups. Being all randomized studies these subgroups will be balanced in the treatment versus control arms – but each study will (as always) be estimating a different – but exchangeable – treatment effect (Exhangeable due to the ignorance about the subgroup memberships of the enrolled patients.) That reasonable assumption – monotonicity – will be to some extent (as always) wrong, but given that it is a risk believed well worth taking – if the average effect in any population is positive (versus negative) the average effect in any other

4 0.77085537 898 andrew gelman stats-2011-09-10-Fourteen magic words: an update

Introduction: In the discussion of the fourteen magic words that can increase voter turnout by over 10 percentage points , questions were raised about the methods used to estimate the experimental effects. I sent these on to Chris Bryan, the author of the study, and he gave the following response: We’re happy to address the questions that have come up. It’s always noteworthy when a precise psychological manipulation like this one generates a large effect on a meaningful outcome. Such findings illustrate the power of the underlying psychological process. I’ve provided the contingency tables for the two turnout experiments below. As indicated in the paper, the data are analyzed using logistic regressions. The change in chi-squared statistic represents the significance of the noun vs. verb condition variable in predicting turnout; that is, the change in the model’s significance when the condition variable is added. This is a standard way to analyze dichotomous outcomes. Four outliers were excl

5 0.7697252 1662 andrew gelman stats-2013-01-09-The difference between “significant” and “non-significant” is not itself statistically significant

Introduction: Commenter Rahul asked what I thought of this note by Scott Firestone ( link from Tyler Cowen) criticizing a recent discussion by Kevin Drum suggesting that lead exposure causes violent crime. Firestone writes: It turns out there was in fact a prospective study done—but its implications for Drum’s argument are mixed. The study was a cohort study done by researchers at the University of Cincinnati. Between 1979 and 1984, 376 infants were recruited. Their parents consented to have lead levels in their blood tested over time; this was matched with records over subsequent decades of the individuals’ arrest records, and specifically arrest for violent crime. Ultimately, some of these individuals were dropped from the study; by the end, 250 were selected for the results. The researchers found that for each increase of 5 micrograms of lead per deciliter of blood, there was a higher risk for being arrested for a violent crime, but a further look at the numbers shows a more mixe

6 0.76409686 1910 andrew gelman stats-2013-06-22-Struggles over the criticism of the “cannabis users and IQ change” paper

7 0.75830603 2042 andrew gelman stats-2013-09-28-Difficulties of using statistical significance (or lack thereof) to sift through and compare research hypotheses

8 0.75657803 963 andrew gelman stats-2011-10-18-Question on Type M errors

9 0.73862863 2223 andrew gelman stats-2014-02-24-“Edlin’s rule” for routinely scaling down published estimates

10 0.73167711 1944 andrew gelman stats-2013-07-18-You’ll get a high Type S error rate if you use classical statistical methods to analyze data from underpowered studies

11 0.7266165 1702 andrew gelman stats-2013-02-01-Don’t let your standard errors drive your research agenda

12 0.72475451 1929 andrew gelman stats-2013-07-07-Stereotype threat!

13 0.72209269 1744 andrew gelman stats-2013-03-01-Why big effects are more important than small effects

14 0.72083497 897 andrew gelman stats-2011-09-09-The difference between significant and not significant…

15 0.71698087 2090 andrew gelman stats-2013-11-05-How much do we trust a new claim that early childhood stimulation raised earnings by 42%?

16 0.71692574 1072 andrew gelman stats-2011-12-19-“The difference between . . .”: It’s not just p=.05 vs. p=.06

17 0.71192193 1776 andrew gelman stats-2013-03-25-The harm done by tests of significance

18 0.71066755 511 andrew gelman stats-2011-01-11-One more time on that ESP study: The problem of overestimates and the shrinkage solution

19 0.70267612 1171 andrew gelman stats-2012-02-16-“False-positive psychology”

20 0.7007395 803 andrew gelman stats-2011-07-14-Subtleties with measurement-error models for the evaluation of wacky claims


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.013), (13, 0.015), (15, 0.021), (16, 0.038), (21, 0.054), (22, 0.013), (24, 0.224), (26, 0.06), (53, 0.027), (65, 0.033), (73, 0.013), (86, 0.019), (88, 0.017), (96, 0.013), (97, 0.012), (99, 0.315)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98358703 1150 andrew gelman stats-2012-02-02-The inevitable problems with statistical significance and 95% intervals

Introduction: I’m thinking more and more that we have to get rid of statistical significance, 95% intervals, and all the rest, and just come to a more fundamental acceptance of uncertainty. In practice, I think we use confidence intervals and hypothesis tests as a way to avoid acknowledging uncertainty. We set up some rules and then act as if we know what is real and what is not. Even in my own applied work, I’ve often enough presented 95% intervals and gone on from there. But maybe that’s just not right. I was thinking about this after receiving the following email from a psychology student: I [the student] am trying to conceptualize the lessons in your paper with Stern with comparing treatment effects across studies. When trying to understand if a certain intervention works, we must look at what the literature says. However this can be complicated if the literature has divergent results. There are four situations I am thinking of. FOr each of these situations, assume the studies are r

2 0.97497499 1941 andrew gelman stats-2013-07-16-Priors

Introduction: Nick Firoozye writes: While I am absolutely sympathetic to the Bayesian agenda I am often troubled by the requirement of having priors. We must have priors on the parameter of an infinite number of model we have never seen before and I find this troubling. There is a similarly troubling problem in economics of utility theory. Utility is on consumables. To be complete a consumer must assign utility to all sorts of things they never would have encountered. More recent versions of utility theory instead make consumption goods a portfolio of attributes. Cadillacs are x many units of luxury y of transport etc etc. And we can automatically have personal utilities to all these attributes. I don’t ever see parameters. Some model have few and some have hundreds. Instead, I see data. So I don’t know how to have an opinion on parameters themselves. Rather I think it far more natural to have opinions on the behavior of models. The prior predictive density is a good and sensible notion. Also

3 0.97482336 1792 andrew gelman stats-2013-04-07-X on JLP

Introduction: Christian Robert writes on the Jeffreys-Lindley paradox. I have nothing to add to this beyond my recent comments : To me, the Lindley paradox falls apart because of its noninformative prior distribution on the parameter of interest. If you really think there’s a high probability the parameter is nearly exactly zero, I don’t see the point of the model saying that you have no prior information at all on the parameter. In short: my criticism of so-called Bayesian hypothesis testing is that it’s insufficiently Bayesian. To clarify, I’m speaking of all the examples I’ve ever worked on in social and environmental science, where in some settings I can imagine a parameter being very close to zero and in other settings I can imagine a parameter taking on just about any value in a wide range, but where I’ve never seen an example where a parameter could be either right at zero or taking on any possible value. But such examples might occur in areas of application that I haven’t worked on.

4 0.97433656 1208 andrew gelman stats-2012-03-11-Gelman on Hennig on Gelman on Bayes

Introduction: Deborah Mayo pointed me to this discussion by Christian Hennig of my recent article on Induction and Deduction in Bayesian Data Analysis. A couple days ago I responded to comments by Mayo, Stephen Senn, and Larry Wasserman. I will respond to Hennig by pulling out paragraphs from his discussion and then replying. Hennig: for me the terms “frequentist” and “subjective Bayes” point to interpretations of probability, and not to specific methods of inference. The frequentist one refers to the idea that there is an underlying data generating process that repeatedly throws out data and would approximate the assumed distribution if one could only repeat it infinitely often. Hennig makes the good point that, if this is the way you would define “frequentist” (it’s not how I’d define the term myself, but I’ll use Hennig’s definition here), then it makes sense to be a frequentist in some settings but not others. Dice really can be rolled over and over again; a sample survey of 15

5 0.9732464 1465 andrew gelman stats-2012-08-21-D. Buggin

Introduction: Joe Zhao writes: I am trying to fit my data using the scaled inverse wishart model you mentioned in your book, Data analysis using regression and hierarchical models. Instead of using a uniform prior on the scale parameters, I try to use a log-normal distribution prior. However, I found that the individual coefficients don’t shrink much to a certain value even a highly informative prior (with extremely low variance) is considered. The coefficients are just very close to their least-squares estimations. Is it because of the log-normal prior I’m using or I’m wrong somewhere? My reply: If your priors are concentrated enough at zero variance, then yeah, the posterior estimates of the parameters should be pulled (almost) all the way to zero. If this isn’t happening, you got a problem. So as a start I’d try putting in some really strong priors concentrated at 0 (for example, N(0,.1^2)) and checking that you get a sensible answer. If not, you might well have a bug. You can also try

6 0.97298926 2149 andrew gelman stats-2013-12-26-Statistical evidence for revised standards

7 0.97291291 1240 andrew gelman stats-2012-04-02-Blogads update

8 0.97199202 2029 andrew gelman stats-2013-09-18-Understanding posterior p-values

9 0.97187364 2340 andrew gelman stats-2014-05-20-Thermodynamic Monte Carlo: Michael Betancourt’s new method for simulating from difficult distributions and evaluating normalizing constants

10 0.97164309 2109 andrew gelman stats-2013-11-21-Hidden dangers of noninformative priors

11 0.97077405 1644 andrew gelman stats-2012-12-30-Fixed effects, followed by Bayes shrinkage?

12 0.97041029 1713 andrew gelman stats-2013-02-08-P-values and statistical practice

13 0.97029734 511 andrew gelman stats-2011-01-11-One more time on that ESP study: The problem of overestimates and the shrinkage solution

14 0.97020185 2086 andrew gelman stats-2013-11-03-How best to compare effects measured in two different time periods?

15 0.97009349 1355 andrew gelman stats-2012-05-31-Lindley’s paradox

16 0.96973628 2129 andrew gelman stats-2013-12-10-Cross-validation and Bayesian estimation of tuning parameters

17 0.96960592 970 andrew gelman stats-2011-10-24-Bell Labs

18 0.96958447 502 andrew gelman stats-2011-01-04-Cash in, cash out graph

19 0.96945 2358 andrew gelman stats-2014-06-03-Did you buy laundry detergent on their most recent trip to the store? Also comments on scientific publication and yet another suggestion to do a study that allows within-person comparisons

20 0.96910906 2208 andrew gelman stats-2014-02-12-How to think about “identifiability” in Bayesian inference?