andrew_gelman_stats andrew_gelman_stats-2014 andrew_gelman_stats-2014-2312 knowledge-graph by maker-knowledge-mining

2312 andrew gelman stats-2014-04-29-Ken Rice presents a unifying approach to statistical inference and hypothesis testing


meta infos for this blog

Source: html

Introduction: Ken Rice writes: In the recent discussion on stopping rules I saw a comment that I wanted to chip in on, but thought it might get a bit lost, in the already long thread. Apologies in advance if I misinterpreted what you wrote, or am trying to tell you things you already know. The comment was: “In Bayesian decision making, there is a utility function and you choose the decision with highest expected utility. Making a decision based on statistical significance does not correspond to any utility function.” … which immediately suggests this little 2010 paper; A Decision-Theoretic Formulation of Fisher’s Approach to Testing, The American Statistician, 64(4) 345-349. It contains utilities that lead to decisions that very closely mimic classical Wald tests, and provides a rationale for why this utility is not totally unconnected from how some scientists think. Some (old) slides discussing it are here . A few notes, on things not in the paper: * I know you don’t like squared-


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ken Rice writes: In the recent discussion on stopping rules I saw a comment that I wanted to chip in on, but thought it might get a bit lost, in the already long thread. [sent-1, score-0.202]

2 Apologies in advance if I misinterpreted what you wrote, or am trying to tell you things you already know. [sent-2, score-0.145]

3 The comment was: “In Bayesian decision making, there is a utility function and you choose the decision with highest expected utility. [sent-3, score-0.737]

4 Making a decision based on statistical significance does not correspond to any utility function. [sent-4, score-0.52]

5 It contains utilities that lead to decisions that very closely mimic classical Wald tests, and provides a rationale for why this utility is not totally unconnected from how some scientists think. [sent-6, score-0.949]

6 Also, I’m not claiming the utilities given are the *only* way to interpret such decisions. [sent-9, score-0.165]

7 * Even if one doesn’t like either squared-error loss or its close relatives, the framework at least provides a way of saying what classical tests and p-values might mean, in the Bayesian paradigm. [sent-10, score-0.729]

8 That they mean something rather different to Bayes factors & posterior probabilities of the null is surprising to many people, particularly those keen to dismiss all use of p-values. [sent-11, score-0.465]

9 I really wrote the paper because I was fed up with unrealistic point-mass priors being the only Bayesian way to get tests; like you, I work in areas where exactly null associations are really hard to defend. [sent-12, score-0.58]

10 ] Here’s the abstract of Rice’s 2010 paper : In Fisher’s interpretation of statistical testing, a test is seen as a ‘screening’ procedure; one either reports some scientific findings, or alternatively gives no firm conclusions. [sent-14, score-0.342]

11 These choices differ fundamentally from hypothesis testing, in the style of Neyman and Pearson, which do not consider a non-committal response; tests are developed as choices between two complimentary hypotheses, typically labeled ‘null’ and ‘alternative’. [sent-15, score-0.695]

12 The same choices are presented in typical Bayesian tests, where Bayes Factors are used to judge the relative support for a null or alternative model. [sent-16, score-0.471]

13 In contrast to hypothesis testing, these ‘screening’ decisions do not exhibit the Lindley/Jeffreys paradox, that divides frequentists and Bayesians. [sent-18, score-0.416]

14 This could represent an important way to look at statistical decision making. [sent-19, score-0.214]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('tests', 0.306), ('utility', 0.241), ('screening', 0.228), ('null', 0.225), ('decision', 0.214), ('wald', 0.188), ('bayesian', 0.182), ('testing', 0.182), ('loss', 0.167), ('utilities', 0.165), ('rice', 0.161), ('choices', 0.154), ('paper', 0.125), ('fisher', 0.122), ('provides', 0.097), ('decisions', 0.095), ('analogs', 0.094), ('describe', 0.094), ('classical', 0.093), ('alternative', 0.092), ('lead', 0.091), ('factors', 0.091), ('making', 0.091), ('mimic', 0.09), ('relatives', 0.09), ('bayes', 0.088), ('frequentists', 0.087), ('misinterpreted', 0.082), ('extends', 0.082), ('unrealistic', 0.082), ('keen', 0.082), ('divides', 0.082), ('hypothesis', 0.081), ('apologies', 0.079), ('fed', 0.077), ('alternatively', 0.077), ('rationale', 0.077), ('ken', 0.076), ('test', 0.074), ('pearson', 0.072), ('associations', 0.071), ('exhibit', 0.071), ('stopping', 0.071), ('formulation', 0.071), ('neyman', 0.071), ('comment', 0.068), ('dismiss', 0.067), ('either', 0.066), ('correspond', 0.065), ('already', 0.063)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 2312 andrew gelman stats-2014-04-29-Ken Rice presents a unifying approach to statistical inference and hypothesis testing

Introduction: Ken Rice writes: In the recent discussion on stopping rules I saw a comment that I wanted to chip in on, but thought it might get a bit lost, in the already long thread. Apologies in advance if I misinterpreted what you wrote, or am trying to tell you things you already know. The comment was: “In Bayesian decision making, there is a utility function and you choose the decision with highest expected utility. Making a decision based on statistical significance does not correspond to any utility function.” … which immediately suggests this little 2010 paper; A Decision-Theoretic Formulation of Fisher’s Approach to Testing, The American Statistician, 64(4) 345-349. It contains utilities that lead to decisions that very closely mimic classical Wald tests, and provides a rationale for why this utility is not totally unconnected from how some scientists think. Some (old) slides discussing it are here . A few notes, on things not in the paper: * I know you don’t like squared-

2 0.24041389 2149 andrew gelman stats-2013-12-26-Statistical evidence for revised standards

Introduction: In response to the discussion of X and me of his recent paper , Val Johnson writes: I would like to thank Andrew for forwarding his comments on uniformly most powerful Bayesian tests (UMPBTs) to me and his invitation to respond to them. I think he (and also Christian Robert) raise a number of interesting points concerning this new class of Bayesian tests, but I think that they may have confounded several issues that might more usefully be examined separately. The first issue involves the choice of the Bayesian evidence threshold, gamma, used in rejecting a null hypothesis in favor of an alternative hypothesis. Andrew objects to the higher values of gamma proposed in my recent PNAS article on grounds that too many important scientific effects would be missed if thresholds of 25-50 were routinely used. These evidence thresholds correspond roughly to p-values of 0.005; Andrew suggests that evidence thresholds around 5 should continue to be used (gamma=5 corresponds approximate

3 0.21654645 1869 andrew gelman stats-2013-05-24-In which I side with Neyman over Fisher

Introduction: As a data analyst and a scientist, Fisher > Neyman, no question. But as a theorist, Fisher came up with ideas that worked just fine in his applications but can fall apart when people try to apply them too generally. Here’s an example that recently came up. Deborah Mayo pointed me to a comment by Stephen Senn on the so-called Fisher and Neyman null hypotheses. In an experiment with n participants (or, as we used to say, subjects or experimental units), the Fisher null hypothesis is that the treatment effect is exactly 0 for every one of the n units, while the Neyman null hypothesis is that the individual treatment effects can be negative or positive but have an average of zero. Senn explains why Neyman’s hypothesis in general makes no sense—the short story is that Fisher’s hypothesis seems relevant in some problems (sometimes we really are studying effects that are zero or close enough for all practical purposes), whereas Neyman’s hypothesis just seems weird (it’s implausible

4 0.19600013 1355 andrew gelman stats-2012-05-31-Lindley’s paradox

Introduction: Sam Seaver writes: I [Seaver] happened to be reading an ironic article by Karl Friston when I learned something new about frequentist vs bayesian, namely Lindley’s paradox, on page 12. The text is as follows: So why are we worried about trivial effects? They are important because the probability that the true effect size is exactly zero is itself zero and could cause us to reject the null hypothesis inappropriately. This is a fallacy of classical inference and is not unrelated to Lindley’s paradox (Lindley 1957). Lindley’s paradox describes a counterintuitive situation in which Bayesian and frequentist approaches to hypothesis testing give opposite results. It occurs when; (i) a result is significant by a frequentist test, indicating sufficient evidence to reject the null hypothesis d=0 and (ii) priors render the posterior probability of d=0 high, indicating strong evidence that the null hypothesis is true. In his original treatment, Lindley (1957) showed that – under a parti

5 0.1938335 256 andrew gelman stats-2010-09-04-Noooooooooooooooooooooooooooooooooooooooooooooooo!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Introduction: Masanao sends this one in, under the heading, “another incident of misunderstood p-value”: Warren Davies, a positive psychology MSc student at UEL, provides the latest in our ongoing series of guest features for students. Warren has just released a Psychology Study Guide, which covers information on statistics, research methods and study skills for psychology students. Despite the myriad rules and procedures of science, some research findings are pure flukes. Perhaps you’re testing a new drug, and by chance alone, a large number of people spontaneously get better. The better your study is conducted, the lower the chance that your result was a fluke – but still, there is always a certain probability that it was. Statistical significance testing gives you an idea of what this probability is. In science we’re always testing hypotheses. We never conduct a study to ‘see what happens’, because there’s always at least one way to make any useless set of data look important. We take

6 0.18970662 291 andrew gelman stats-2010-09-22-Philosophy of Bayes and non-Bayes: A dialogue with Deborah Mayo

7 0.18885492 1200 andrew gelman stats-2012-03-06-Some economists are skeptical about microfoundations

8 0.18271784 1205 andrew gelman stats-2012-03-09-Coming to agreement on philosophy of statistics

9 0.17843574 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes

10 0.17400081 2281 andrew gelman stats-2014-04-04-The Notorious N.H.S.T. presents: Mo P-values Mo Problems

11 0.16998163 1510 andrew gelman stats-2012-09-25-Incoherence of Bayesian data analysis

12 0.16921051 2295 andrew gelman stats-2014-04-18-One-tailed or two-tailed?

13 0.16118746 1409 andrew gelman stats-2012-07-08-Is linear regression unethical in that it gives more weight to cases that are far from the average?

14 0.15565158 2263 andrew gelman stats-2014-03-24-Empirical implications of Empirical Implications of Theoretical Models

15 0.14876285 1024 andrew gelman stats-2011-11-23-Of hypothesis tests and Unitarians

16 0.14548384 1572 andrew gelman stats-2012-11-10-I don’t like this cartoon

17 0.14119726 922 andrew gelman stats-2011-09-24-Economists don’t think like accountants—but maybe they should

18 0.13963106 1455 andrew gelman stats-2012-08-12-Probabilistic screening to get an approximate self-weighted sample

19 0.13918489 2305 andrew gelman stats-2014-04-25-Revised statistical standards for evidence (comments to Val Johnson’s comments on our comments on Val’s comments on p-values)

20 0.13595378 1880 andrew gelman stats-2013-06-02-Flame bait


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.221), (1, 0.122), (2, -0.058), (3, -0.045), (4, -0.136), (5, -0.067), (6, -0.033), (7, 0.079), (8, 0.028), (9, -0.125), (10, -0.094), (11, 0.02), (12, 0.028), (13, -0.071), (14, 0.073), (15, -0.04), (16, 0.003), (17, -0.041), (18, -0.031), (19, -0.021), (20, 0.045), (21, 0.043), (22, 0.013), (23, 0.012), (24, -0.057), (25, -0.09), (26, 0.051), (27, -0.007), (28, 0.029), (29, 0.019), (30, 0.03), (31, -0.013), (32, 0.111), (33, 0.018), (34, -0.037), (35, -0.107), (36, 0.068), (37, -0.031), (38, 0.044), (39, 0.053), (40, -0.083), (41, 0.02), (42, -0.006), (43, -0.0), (44, -0.014), (45, 0.036), (46, 0.027), (47, -0.059), (48, 0.049), (49, 0.003)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96167511 2312 andrew gelman stats-2014-04-29-Ken Rice presents a unifying approach to statistical inference and hypothesis testing

Introduction: Ken Rice writes: In the recent discussion on stopping rules I saw a comment that I wanted to chip in on, but thought it might get a bit lost, in the already long thread. Apologies in advance if I misinterpreted what you wrote, or am trying to tell you things you already know. The comment was: “In Bayesian decision making, there is a utility function and you choose the decision with highest expected utility. Making a decision based on statistical significance does not correspond to any utility function.” … which immediately suggests this little 2010 paper; A Decision-Theoretic Formulation of Fisher’s Approach to Testing, The American Statistician, 64(4) 345-349. It contains utilities that lead to decisions that very closely mimic classical Wald tests, and provides a rationale for why this utility is not totally unconnected from how some scientists think. Some (old) slides discussing it are here . A few notes, on things not in the paper: * I know you don’t like squared-

2 0.89111418 1355 andrew gelman stats-2012-05-31-Lindley’s paradox

Introduction: Sam Seaver writes: I [Seaver] happened to be reading an ironic article by Karl Friston when I learned something new about frequentist vs bayesian, namely Lindley’s paradox, on page 12. The text is as follows: So why are we worried about trivial effects? They are important because the probability that the true effect size is exactly zero is itself zero and could cause us to reject the null hypothesis inappropriately. This is a fallacy of classical inference and is not unrelated to Lindley’s paradox (Lindley 1957). Lindley’s paradox describes a counterintuitive situation in which Bayesian and frequentist approaches to hypothesis testing give opposite results. It occurs when; (i) a result is significant by a frequentist test, indicating sufficient evidence to reject the null hypothesis d=0 and (ii) priors render the posterior probability of d=0 high, indicating strong evidence that the null hypothesis is true. In his original treatment, Lindley (1957) showed that – under a parti

3 0.85537457 2149 andrew gelman stats-2013-12-26-Statistical evidence for revised standards

Introduction: In response to the discussion of X and me of his recent paper , Val Johnson writes: I would like to thank Andrew for forwarding his comments on uniformly most powerful Bayesian tests (UMPBTs) to me and his invitation to respond to them. I think he (and also Christian Robert) raise a number of interesting points concerning this new class of Bayesian tests, but I think that they may have confounded several issues that might more usefully be examined separately. The first issue involves the choice of the Bayesian evidence threshold, gamma, used in rejecting a null hypothesis in favor of an alternative hypothesis. Andrew objects to the higher values of gamma proposed in my recent PNAS article on grounds that too many important scientific effects would be missed if thresholds of 25-50 were routinely used. These evidence thresholds correspond roughly to p-values of 0.005; Andrew suggests that evidence thresholds around 5 should continue to be used (gamma=5 corresponds approximate

4 0.82958502 2281 andrew gelman stats-2014-04-04-The Notorious N.H.S.T. presents: Mo P-values Mo Problems

Introduction: A recent discussion between commenters Question and Fernando captured one of the recurrent themes here from the past year. Question: The problem is simple, the researchers are disproving always false null hypotheses and taking this disproof as near proof that their theory is correct. Fernando: Whereas it is probably true that researchers misuse NHT, the problem with tabloid science is broader and deeper. It is systemic. Question: I do not see how anything can be deeper than replacing careful description, prediction, falsification, and independent replication with dynamite plots, p-values, affirming the consequent, and peer review. From my own experience I am confident in saying that confusion caused by NHST is at the root of this problem. Fernando: Incentives? Impact factors? Publish or die? “Interesting” and “new” above quality and reliability, or actually answering a research question, and a silly and unbecoming obsession with being quoted in NYT, etc. . . . Giv

5 0.80442876 1024 andrew gelman stats-2011-11-23-Of hypothesis tests and Unitarians

Introduction: Xian, Judith, and I read this line in a book by statistician Murray Aitkin in which he considered the following hypothetical example: A survey of 100 individuals expressing support (Yes/No) for the president, before and after a presidential address . . . The question of interest is whether there has been a change in support between the surveys . . . We want to assess the evidence for the hypothesis of equality H1 against the alternative hypothesis H2 of a change. Here is our response : Based on our experience in public opinion research, this is not a real question. Support for any political position is always changing. The real question is how much the support has changed, or perhaps how this change is distributed across the population. A defender of Aitkin (and of classical hypothesis testing) might respond at this point that, yes, everybody knows that changes are never exactly zero and that we should take a more “grown-up” view of the null hypothesis, not that the change

6 0.79429579 2295 andrew gelman stats-2014-04-18-One-tailed or two-tailed?

7 0.76428002 2305 andrew gelman stats-2014-04-25-Revised statistical standards for evidence (comments to Val Johnson’s comments on our comments on Val’s comments on p-values)

8 0.76058084 331 andrew gelman stats-2010-10-10-Bayes jumps the shark

9 0.75405431 2078 andrew gelman stats-2013-10-26-“The Bayesian approach to forensic evidence”

10 0.75252044 2263 andrew gelman stats-2014-03-24-Empirical implications of Empirical Implications of Theoretical Models

11 0.73722553 643 andrew gelman stats-2011-04-02-So-called Bayesian hypothesis testing is just as bad as regular hypothesis testing

12 0.73209524 1869 andrew gelman stats-2013-05-24-In which I side with Neyman over Fisher

13 0.71879172 1095 andrew gelman stats-2012-01-01-Martin and Liu: Probabilistic inference based on consistency of model with data

14 0.71324158 2272 andrew gelman stats-2014-03-29-I agree with this comment

15 0.70671099 256 andrew gelman stats-2010-09-04-Noooooooooooooooooooooooooooooooooooooooooooooooo!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

16 0.69042397 2183 andrew gelman stats-2014-01-23-Discussion on preregistration of research studies

17 0.68977225 506 andrew gelman stats-2011-01-06-That silly ESP paper and some silliness in a rebuttal as well

18 0.68626821 1883 andrew gelman stats-2013-06-04-Interrogating p-values

19 0.685619 1575 andrew gelman stats-2012-11-12-Thinking like a statistician (continuously) rather than like a civilian (discretely)

20 0.6852563 114 andrew gelman stats-2010-06-28-More on Bayesian deduction-induction


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(4, 0.014), (16, 0.043), (21, 0.051), (24, 0.299), (27, 0.018), (31, 0.022), (36, 0.028), (47, 0.022), (77, 0.024), (84, 0.045), (86, 0.038), (94, 0.055), (96, 0.026), (99, 0.22)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97176862 2312 andrew gelman stats-2014-04-29-Ken Rice presents a unifying approach to statistical inference and hypothesis testing

Introduction: Ken Rice writes: In the recent discussion on stopping rules I saw a comment that I wanted to chip in on, but thought it might get a bit lost, in the already long thread. Apologies in advance if I misinterpreted what you wrote, or am trying to tell you things you already know. The comment was: “In Bayesian decision making, there is a utility function and you choose the decision with highest expected utility. Making a decision based on statistical significance does not correspond to any utility function.” … which immediately suggests this little 2010 paper; A Decision-Theoretic Formulation of Fisher’s Approach to Testing, The American Statistician, 64(4) 345-349. It contains utilities that lead to decisions that very closely mimic classical Wald tests, and provides a rationale for why this utility is not totally unconnected from how some scientists think. Some (old) slides discussing it are here . A few notes, on things not in the paper: * I know you don’t like squared-

2 0.96133578 1376 andrew gelman stats-2012-06-12-Simple graph WIN: the example of birthday frequencies

Introduction: From Chris Mulligan: The data come from the Center for Disease Control and cover the years 1969-1988. Chris also gives instructions for how to download the data and plot them in R from scratch (in 30 lines of R code)! And now, the background A few months ago I heard about a study reporting that, during a recent eleven-year period, more babies were born on Valentine’s Day and fewer on Halloween compared to neighboring days: I wrote , What I’d really like to see is a graph with all 366 days of the year. It would be easy enough to make. That way we could put the Valentine’s and Halloween data in the context of other possible patterns. While they’re at it, they could also graph births by day of the week and show Thanksgiving, Easter, and other holidays that don’t have fixed dates. It’s so frustrating when people only show part of the story. I was pointed to some tables: and a graph from Matt Stiles: The heatmap is cute but I wanted to se

3 0.9598341 2143 andrew gelman stats-2013-12-22-The kluges of today are the textbook solutions of tomorrow.

Introduction: From a response on the Stan help list: Yes, indeed, I think it would be a good idea to reduce the scale on priors of the form U(0,100) or N(0,100^2). This won’t solve all problems but it can’t hurt. If the issue is that the variance parameter can be very small in the estimation, yes, one approach would be to put in a prior that keeps the variance away from 0 (lognormal, gamma, whatever), another approach would be to use the Matt trick. Some mixture of these ideas might help. And, by the way: when you do these things it might feel like an awkward bit of kluging to play around with the model to get it to convert properly. But the kluges of today are the textbook solutions of tomorrow. When it comes to statistical modeling, we’re living in beta-test world; we should appreciate the opportunities this gives us!

4 0.95935786 1706 andrew gelman stats-2013-02-04-Too many MC’s not enough MIC’s, or What principles should govern attempts to summarize bivariate associations in large multivariate datasets?

Introduction: Justin Kinney writes: Since your blog has discussed the “maximal information coefficient” (MIC) of Reshef et al., I figured you might want to see the critique that Gurinder Atwal and I have posted. In short, Reshef et al.’s central claim that MIC is “equitable” is incorrect. We [Kinney and Atwal] offer mathematical proof that the definition of “equitability” Reshef et al. propose is unsatisfiable—no nontrivial dependence measure, including MIC, has this property. Replicating the simulations in their paper with modestly larger data sets validates this finding. The heuristic notion of equitability, however, can be formalized instead as a self-consistency condition closely related to the Data Processing Inequality. Mutual information satisfies this new definition of equitability but MIC does not. We therefore propose that simply estimating mutual information will, in many cases, provide the sort of dependence measure Reshef et al. seek. For background, here are my two p

5 0.95858753 1455 andrew gelman stats-2012-08-12-Probabilistic screening to get an approximate self-weighted sample

Introduction: Sharad had a survey sampling question: We’re trying to use mechanical turk to conduct some surveys, and have quickly discovered that turkers tend to be quite young. We’d really like a representative sample of the U.S., or at the least be able to recruit a diverse enough sample from turk that we can post-stratify to adjust the estimates. The approach we ended up taking is to pay turkers a small amount to answer a couple of screening questions (age & sex), and then probabilistically recruit individuals to complete the full survey (for more money) based on the estimated turk population parameters and our desired target distribution. We use rejection sampling, so the end result is that individuals who are invited to take the full survey look as if they came from a representative sample, at least in terms of age and sex. I’m wondering whether this sort of technique—a two step design in which participants are first screened and then probabilistically selected to mimic a target distributio

6 0.95849603 482 andrew gelman stats-2010-12-23-Capitalism as a form of voluntarism

7 0.9579978 953 andrew gelman stats-2011-10-11-Steve Jobs’s cancer and science-based medicine

8 0.95743859 1757 andrew gelman stats-2013-03-11-My problem with the Lindley paradox

9 0.95696461 743 andrew gelman stats-2011-06-03-An argument that can’t possibly make sense

10 0.95678669 1092 andrew gelman stats-2011-12-29-More by Berger and me on weakly informative priors

11 0.95674062 1584 andrew gelman stats-2012-11-19-Tradeoffs in information graphics

12 0.95637012 278 andrew gelman stats-2010-09-15-Advice that might make sense for individuals but is negative-sum overall

13 0.9559164 1999 andrew gelman stats-2013-08-27-Bayesian model averaging or fitting a larger model

14 0.95518732 1479 andrew gelman stats-2012-09-01-Mothers and Moms

15 0.95518672 1978 andrew gelman stats-2013-08-12-Fixing the race, ethnicity, and national origin questions on the U.S. Census

16 0.95447803 2231 andrew gelman stats-2014-03-03-Running into a Stan Reference by Accident

17 0.95342314 938 andrew gelman stats-2011-10-03-Comparing prediction errors

18 0.95273626 241 andrew gelman stats-2010-08-29-Ethics and statistics in development research

19 0.9522683 2017 andrew gelman stats-2013-09-11-“Informative g-Priors for Logistic Regression”

20 0.95126307 1891 andrew gelman stats-2013-06-09-“Heterogeneity of variance in experimental studies: A challenge to conventional interpretations”