andrew_gelman_stats andrew_gelman_stats-2014 andrew_gelman_stats-2014-2277 knowledge-graph by maker-knowledge-mining

2277 andrew gelman stats-2014-03-31-The most-cited statistics papers ever


meta infos for this blog

Source: html

Introduction: Robert Grant has a list . I’ll just give the ones with more than 10,000 Google Scholar cites: Cox (1972) Regression and life tables: 35,512 citations. Dempster, Laird, Rubin (1977) Maximum likelihood from incomplete data via the EM algorithm: 34,988 Bland & Altman (1986) Statistical methods for assessing agreement between two methods of clinical measurement: 27,181 Geman & Geman (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images: 15,106 We can find some more via searching Google scholar for familiar names and topics; thus: Metropolis et al. (1953) Equation of state calculations by fast computing machines: 26,000 Benjamini and Hochberg (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing: 21,000 White (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity: 18,000 Heckman (1977) Sample selection bias as a specification error:


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 But I’m guessing there are some biggies I’m missing. [sent-5, score-0.084]

2 I say this because Grant’s original list included one paper, by Bland and Altman, with over 27,000 cites, that I’d never heard of! [sent-6, score-0.133]

3 I agree with Grant that using Google Scholar favors newer papers. [sent-9, score-0.068]

4 For example, Cooley and Tukey (1965), “An algorithm for the machine calculation of complex Fourier series,” does not make the list, amazingly enough, with only 9300 cites. [sent-10, score-0.167]

5 And the hugely influential book by Snedecor and Cochran has very few cites, I guess cos nobody cites it anymore. [sent-11, score-0.419]

6 And, of course, the most influential researchers such as Laplace, Gauss, Fisher, Neyman, Pearson, etc. [sent-12, score-0.104]

7 If Pearson got a cite for every chi-squared test, Neyman for every rejection region, Fisher for every maximum-likelihood estimate, etc. [sent-14, score-0.234]

8 , their citations would run into the mid to high zillions each. [sent-15, score-0.197]

9 I wrote this post a few months ago so all the citations have gone up. [sent-19, score-0.113]

10 For example, the fuzzy sets paper is now listed at 49,000, and Zadeh has a second paper, “Outline of a new approach to the analysis of complex systems and decision processes,” with 16,000 cites. [sent-20, score-0.385]

11 On the upside, Efron’s 1979 paper, “Bootstrap methods: another look at the jackknife,” has just pulled itself over the 10,000 cites mark. [sent-22, score-0.315]

12 Also, I just checked and Tibshirani’s paper on lasso is at 9873, so in the not too distant future it will make the list too. [sent-24, score-0.221]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('cites', 0.315), ('altman', 0.186), ('breiman', 0.186), ('zadeh', 0.186), ('heteroskedasticity', 0.159), ('geman', 0.159), ('scholar', 0.159), ('grant', 0.159), ('bland', 0.152), ('fuzzy', 0.136), ('list', 0.133), ('pearson', 0.123), ('specification', 0.121), ('neyman', 0.119), ('google', 0.114), ('citations', 0.113), ('covariance', 0.112), ('influential', 0.104), ('fisher', 0.104), ('matrix', 0.1), ('methods', 0.098), ('maximum', 0.096), ('rubin', 0.089), ('algorithm', 0.089), ('paper', 0.088), ('laird', 0.084), ('jackknife', 0.084), ('mid', 0.084), ('mediator', 0.084), ('biggies', 0.084), ('forests', 0.084), ('hausman', 0.084), ('sets', 0.083), ('granger', 0.08), ('gauss', 0.08), ('restoration', 0.08), ('every', 0.078), ('complex', 0.078), ('rosenbaum', 0.076), ('autoregressive', 0.076), ('relaxation', 0.076), ('moderator', 0.076), ('via', 0.074), ('likelihood', 0.074), ('dempster', 0.074), ('snedecor', 0.074), ('fourier', 0.074), ('series', 0.071), ('cochran', 0.07), ('newer', 0.068)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 2277 andrew gelman stats-2014-03-31-The most-cited statistics papers ever

Introduction: Robert Grant has a list . I’ll just give the ones with more than 10,000 Google Scholar cites: Cox (1972) Regression and life tables: 35,512 citations. Dempster, Laird, Rubin (1977) Maximum likelihood from incomplete data via the EM algorithm: 34,988 Bland & Altman (1986) Statistical methods for assessing agreement between two methods of clinical measurement: 27,181 Geman & Geman (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images: 15,106 We can find some more via searching Google scholar for familiar names and topics; thus: Metropolis et al. (1953) Equation of state calculations by fast computing machines: 26,000 Benjamini and Hochberg (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing: 21,000 White (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity: 18,000 Heckman (1977) Sample selection bias as a specification error:

2 0.18123402 109 andrew gelman stats-2010-06-25-Classics of statistics

Introduction: Christian Robert is planning a graduate seminar in which students read 15 classic articles of statistics. (See here for more details and a slightly different list.) Actually, he just writes “classics,” but based on his list, I assume he only wants articles, not books. If he wanted to include classic books, I’d nominate the following, just for starters: - Fisher’s Statistical Methods for Research Workers - Snedecor and Cochran’s Statistical Methods - Kish’s Survey Sampling - Box, Hunter, and Hunter’s Statistics for Experimenters - Tukey’s Exploratory Data Analysis - Cleveland’s The Elements of Graphing Data - Mosteller and Wallace’s book on the Federalist Papers. Probably Cox and Hinkley, too. That’s a book that I don’t think has aged well, but it seems to have had a big influence. I think there’s a lot more good and accessible material in these classic books than in the equivalent volume of classic articles. Journal articles can be difficult to read and are typicall

3 0.13582365 1880 andrew gelman stats-2013-06-02-Flame bait

Introduction: Mark Palko asks what I think of this article by Francisco Louca, who writes about “‘hybridization’, a synthesis between Fisherian and Neyman-Pearsonian precepts, defined as a number of practical proceedings for statistical testing and inference that were developed notwithstanding the original authors, as an eventual convergence between what they considered to be radically irreconcilable.” To me, the statistical ideas in this paper are too old-fashioned. The issue is not that the Neyman-Pearson and Fisher approaches are “irreconcilable” but rather that neither does the job in the sort of hard problems that face statistical science today. I’m thinking of technically difficult models such as hierarchical Gaussian processes and also challenges that arise with small sample size and multiple testing. Neyman, Pearson, and Fisher all were brilliant, and they all developed statistical methods that remain useful today, but I think their foundations are out of date. Yes, we currently use m

4 0.12932058 1869 andrew gelman stats-2013-05-24-In which I side with Neyman over Fisher

Introduction: As a data analyst and a scientist, Fisher > Neyman, no question. But as a theorist, Fisher came up with ideas that worked just fine in his applications but can fall apart when people try to apply them too generally. Here’s an example that recently came up. Deborah Mayo pointed me to a comment by Stephen Senn on the so-called Fisher and Neyman null hypotheses. In an experiment with n participants (or, as we used to say, subjects or experimental units), the Fisher null hypothesis is that the treatment effect is exactly 0 for every one of the n units, while the Neyman null hypothesis is that the individual treatment effects can be negative or positive but have an average of zero. Senn explains why Neyman’s hypothesis in general makes no sense—the short story is that Fisher’s hypothesis seems relevant in some problems (sometimes we really are studying effects that are zero or close enough for all practical purposes), whereas Neyman’s hypothesis just seems weird (it’s implausible

5 0.12436163 1047 andrew gelman stats-2011-12-08-I Am Too Absolutely Heteroskedastic for This Probit Model

Introduction: Soren Lorensen wrote: I’m working on a project that uses a binary choice model on panel data. Since I have panel data and am using MLE, I’m concerned about heteroskedasticity making my estimates inconsistent and biased. Are you familiar with any statistical packages with pre-built tests for heteroskedasticity in binary choice ML models? If not, is there value in cutting my data into groups over which I guess the error variance might vary and eyeballing residual plots? Have you other suggestions about how I might resolve this concern? I replied that I wouldn’t worry so much about heteroskedasticity. Breaking up the data into pieces might make sense, but for the purpose of estimating how the coefficients might vary—that is, nonlinearity and interactions. Soren shot back: I’m somewhat puzzled however: homoskedasticity is an identifying assumption in estimating a probit model: if we don’t have it all sorts of bad things can happen to our parameter estimates. Do you suggest n

6 0.12291903 1309 andrew gelman stats-2012-05-09-The first version of my “inference from iterative simulation using parallel sequences” paper!

7 0.1145271 1769 andrew gelman stats-2013-03-18-Tibshirani announces new research result: A significance test for the lasso

8 0.10460306 774 andrew gelman stats-2011-06-20-The pervasive twoishness of statistics; in particular, the “sampling distribution” and the “likelihood” are two different models, and that’s a good thing

9 0.10262702 1962 andrew gelman stats-2013-07-30-The Roy causal model?

10 0.099896237 1477 andrew gelman stats-2012-08-30-Visualizing Distributions of Covariance Matrices

11 0.099137001 1560 andrew gelman stats-2012-11-03-Statistical methods that work in some settings but not others

12 0.096469872 1991 andrew gelman stats-2013-08-21-BDA3 table of contents (also a new paper on visualization)

13 0.09590359 1951 andrew gelman stats-2013-07-22-Top 5 stat papers since 2000?

14 0.094968624 738 andrew gelman stats-2011-05-30-Works well versus well understood

15 0.092171527 1149 andrew gelman stats-2012-02-01-Philosophy of Bayesian statistics: my reactions to Cox and Mayo

16 0.090048075 1418 andrew gelman stats-2012-07-16-Long discussion about causal inference and the use of hierarchical models to bridge between different inferential settings

17 0.086087346 1763 andrew gelman stats-2013-03-14-Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses

18 0.084993131 2170 andrew gelman stats-2014-01-13-Judea Pearl overview on causal inference, and more general thoughts on the reexpression of existing methods by considering their implicit assumptions

19 0.083542183 1690 andrew gelman stats-2013-01-23-When are complicated models helpful in psychology research and when are they overkill?

20 0.083468191 2312 andrew gelman stats-2014-04-29-Ken Rice presents a unifying approach to statistical inference and hypothesis testing


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.163), (1, 0.073), (2, -0.027), (3, -0.038), (4, 0.01), (5, 0.013), (6, -0.048), (7, -0.047), (8, 0.018), (9, -0.005), (10, -0.008), (11, 0.005), (12, 0.001), (13, -0.021), (14, 0.04), (15, 0.004), (16, -0.007), (17, -0.008), (18, -0.024), (19, -0.043), (20, 0.014), (21, -0.007), (22, 0.076), (23, 0.028), (24, 0.039), (25, 0.028), (26, -0.033), (27, 0.044), (28, 0.057), (29, 0.032), (30, 0.012), (31, 0.019), (32, 0.041), (33, 0.028), (34, 0.016), (35, -0.028), (36, -0.024), (37, -0.008), (38, -0.008), (39, 0.023), (40, -0.07), (41, 0.047), (42, 0.035), (43, 0.015), (44, -0.008), (45, 0.035), (46, -0.022), (47, -0.047), (48, 0.024), (49, -0.079)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96543002 2277 andrew gelman stats-2014-03-31-The most-cited statistics papers ever

Introduction: Robert Grant has a list . I’ll just give the ones with more than 10,000 Google Scholar cites: Cox (1972) Regression and life tables: 35,512 citations. Dempster, Laird, Rubin (1977) Maximum likelihood from incomplete data via the EM algorithm: 34,988 Bland & Altman (1986) Statistical methods for assessing agreement between two methods of clinical measurement: 27,181 Geman & Geman (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images: 15,106 We can find some more via searching Google scholar for familiar names and topics; thus: Metropolis et al. (1953) Equation of state calculations by fast computing machines: 26,000 Benjamini and Hochberg (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing: 21,000 White (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity: 18,000 Heckman (1977) Sample selection bias as a specification error:

2 0.77952152 1880 andrew gelman stats-2013-06-02-Flame bait

Introduction: Mark Palko asks what I think of this article by Francisco Louca, who writes about “‘hybridization’, a synthesis between Fisherian and Neyman-Pearsonian precepts, defined as a number of practical proceedings for statistical testing and inference that were developed notwithstanding the original authors, as an eventual convergence between what they considered to be radically irreconcilable.” To me, the statistical ideas in this paper are too old-fashioned. The issue is not that the Neyman-Pearson and Fisher approaches are “irreconcilable” but rather that neither does the job in the sort of hard problems that face statistical science today. I’m thinking of technically difficult models such as hierarchical Gaussian processes and also challenges that arise with small sample size and multiple testing. Neyman, Pearson, and Fisher all were brilliant, and they all developed statistical methods that remain useful today, but I think their foundations are out of date. Yes, we currently use m

3 0.72068709 1309 andrew gelman stats-2012-05-09-The first version of my “inference from iterative simulation using parallel sequences” paper!

Introduction: From August 1990. It was in the form of a note sent to all the people in the statistics group of Bell Labs, where I’d worked that summer. To all: Here’s the abstract of the work I’ve done this summer. It’s stored in the file, /fs5/gelman/abstract.bell, and copies of the Figures 1-3 are on Trevor’s desk. Any comments are of course appreciated; I’m at gelman@stat.berkeley.edu. On the Routine Use of Markov Chains for Simulation Andrew Gelman and Donald Rubin, 6 August 1990 corrected version: 8 August 1990 1. Simulation In probability and statistics we can often specify multivariate distributions many of whose properties we do not fully understand–perhaps, as in the Ising model of statistical physics, we can write the joint density function, up to a multiplicative constant that cannot be expressed in closed form. For an example in statistics, consider the Normal random effects model in the analysis of variance, which can be easily placed in a Bayesian fram

4 0.68440181 1443 andrew gelman stats-2012-08-04-Bayesian Learning via Stochastic Gradient Langevin Dynamics

Introduction: Burak Bayramli writes: In this paper by Sunjin Ahn, Anoop Korattikara, and Max Welling and this paper by Welling and Yee Whye The, there are some arguments on big data and the use of MCMC. Both papers have suggested improvements to speed up MCMC computations. I was wondering what your thoughts were, especially on this paragraph: When a dataset has a billion data-cases (as is not uncommon these days) MCMC algorithms will not even have generated a single (burn-in) sample when a clever learning algorithm based on stochastic gradients may already be making fairly good predictions. In fact, the intriguing results of Bottou and Bousquet (2008) seem to indicate that in terms of “number of bits learned per unit of computation”, an algorithm as simple as stochastic gradient descent is almost optimally efficient. We therefore argue that for Bayesian methods to remain useful in an age when the datasets grow at an exponential rate, they need to embrace the ideas of the stochastic optimiz

5 0.66911274 1991 andrew gelman stats-2013-08-21-BDA3 table of contents (also a new paper on visualization)

Introduction: In response to our recent posting of Amazon’s offer of Bayesian Data Analysis 3rd edition at 40% off, some people asked what was in this new edition, with more information beyond the beautiful cover image and the brief paragraph I’d posted earlier. Here’s the table of contents. The following sections have all-new material: 1.4 New introduction of BDA principles using a simple spell checking example 2.9 Weakly informative prior distributions 5.7 Weakly informative priors for hierarchical variance parameters 7.1-7.4 Predictive accuracy for model evaluation and comparison 10.6 Computing environments 11.4 Split R-hat 11.5 New measure of effective number of simulation draws 13.7 Variational inference 13.8 Expectation propagation 13.9 Other approximations 14.6 Regularization for regression models C.1 Getting started with R and Stan C.2 Fitting a hierarchical model in Stan C.4 Programming Hamiltonian Monte Carlo in R And the new chapters: 20 Basis function models 2

6 0.66799694 778 andrew gelman stats-2011-06-24-New ideas on DIC from Martyn Plummer and Sumio Watanabe

7 0.66771221 1477 andrew gelman stats-2012-08-30-Visualizing Distributions of Covariance Matrices

8 0.64866263 674 andrew gelman stats-2011-04-21-Handbook of Markov Chain Monte Carlo

9 0.64432567 1339 andrew gelman stats-2012-05-23-Learning Differential Geometry for Hamiltonian Monte Carlo

10 0.64129573 1095 andrew gelman stats-2012-01-01-Martin and Liu: Probabilistic inference based on consistency of model with data

11 0.63967508 744 andrew gelman stats-2011-06-03-Statistical methods for healthcare regulation: rating, screening and surveillance

12 0.63519317 1739 andrew gelman stats-2013-02-26-An AI can build and try out statistical models using an open-ended generative grammar

13 0.63515854 501 andrew gelman stats-2011-01-04-A new R package for fititng multilevel models

14 0.63268232 555 andrew gelman stats-2011-02-04-Handy Matrix Cheat Sheet, with Gradients

15 0.62947708 32 andrew gelman stats-2010-05-14-Causal inference in economics

16 0.62753248 2170 andrew gelman stats-2014-01-13-Judea Pearl overview on causal inference, and more general thoughts on the reexpression of existing methods by considering their implicit assumptions

17 0.62206423 2258 andrew gelman stats-2014-03-21-Random matrices in the news

18 0.61643434 1270 andrew gelman stats-2012-04-19-Demystifying Blup

19 0.61611199 2117 andrew gelman stats-2013-11-29-The gradual transition to replicable science

20 0.61108214 595 andrew gelman stats-2011-02-28-What Zombies see in Scatterplots


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(15, 0.117), (16, 0.064), (24, 0.116), (27, 0.057), (30, 0.053), (31, 0.032), (35, 0.015), (43, 0.018), (53, 0.02), (55, 0.038), (57, 0.023), (68, 0.027), (69, 0.019), (72, 0.016), (75, 0.017), (83, 0.012), (86, 0.07), (99, 0.18)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.94524819 2277 andrew gelman stats-2014-03-31-The most-cited statistics papers ever

Introduction: Robert Grant has a list . I’ll just give the ones with more than 10,000 Google Scholar cites: Cox (1972) Regression and life tables: 35,512 citations. Dempster, Laird, Rubin (1977) Maximum likelihood from incomplete data via the EM algorithm: 34,988 Bland & Altman (1986) Statistical methods for assessing agreement between two methods of clinical measurement: 27,181 Geman & Geman (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images: 15,106 We can find some more via searching Google scholar for familiar names and topics; thus: Metropolis et al. (1953) Equation of state calculations by fast computing machines: 26,000 Benjamini and Hochberg (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing: 21,000 White (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity: 18,000 Heckman (1977) Sample selection bias as a specification error:

2 0.87758338 945 andrew gelman stats-2011-10-06-W’man < W’pedia, again

Introduction: Blogger Deep Climate looks at another paper by the 2002 recipient of the American Statistical Association’s Founders award. This time it’s not funny, it’s just sad. Here’s Wikipedia on simulated annealing: By analogy with this physical process, each step of the SA algorithm replaces the current solution by a random “nearby” solution, chosen with a probability that depends on the difference between the corresponding function values and on a global parameter T (called the temperature), that is gradually decreased during the process. The dependency is such that the current solution changes almost randomly when T is large, but increasingly “downhill” as T goes to zero. The allowance for “uphill” moves saves the method from becoming stuck at local minima—which are the bane of greedier methods. And here’s Wegman: During each step of the algorithm, the variable that will eventually represent the minimum is replaced by a random solution that is chosen according to a temperature

3 0.87729281 133 andrew gelman stats-2010-07-08-Gratuitous use of “Bayesian Statistics,” a branding issue?

Introduction: I’m on an island in Maine for a few weeks (big shout out for North Haven!) This morning I picked up a copy of “Working Waterfront,” a newspaper that focuses on issues of coastal fishing communities. I came across an article about modeling “fish” populations — actually lobsters, I guess they’re considered “fish” for regulatory purposes. When I read it, I thought “wow, this article is really well-written, not dumbed down like articles in most newspapers.” I think it’s great that a small coastal newspaper carries reporting like this. (The online version has a few things that I don’t recall in the print version, too, so it’s even better). But in addition to being struck by finding such a good article in a small newspaper, I was struck by this: According to [University of Maine scientist Yong] Chen, there are four main areas where his model improved on the prior version. “We included the inshore trawl data from Maine and other state surveys, in addition to federal survey data; we h

4 0.87250614 1081 andrew gelman stats-2011-12-24-Statistical ethics violation

Introduction: A colleague writes: When I was in NYC I went to this party by group of Japanese bio-scientists. There, one guy told me about how the biggest pharmaceutical company in Japan did their statistics. They ran 100 different tests and reported the most significant one. (This was in 2006 and he said they stopped doing this few years back so they were doing this until pretty recently…) I’m not sure if this was 100 multiple comparison or 100 different kinds of test but I’m sure they wouldn’t want to disclose their data… Ouch!

5 0.86873531 1541 andrew gelman stats-2012-10-19-Statistical discrimination again

Introduction: Mark Johnstone writes: I’ve recently been investigating a new European Court of Justice ruling on insurance calculations (on behalf of MoneySuperMarket) and I found something related to statistics that caught my attention. . . . The ruling (which comes into effect in December 2012) states that insurers in Europe can no longer provide different premiums based on gender. Despite the fact that women are statistically safer drivers, unless it’s biologically proven there is a causal relationship between being female and being a safer driver, this is now seen as an act of discrimination (more on this from the Wall Street Journal). However, where do you stop with this? What about age? What about other factors? And what does this mean for the application of statistics in general? Is it inherently unjust in this context? One proposal has been to fit ‘black boxes’ into cars so more individual data can be collected, as opposed to relying heavily on aggregates. For fans of data and s

6 0.86507833 1800 andrew gelman stats-2013-04-12-Too tired to mock

7 0.86167127 2177 andrew gelman stats-2014-01-19-“The British amateur who debunked the mathematics of happiness”

8 0.86165667 329 andrew gelman stats-2010-10-08-More on those dudes who will pay your professor $8000 to assign a book to your class, and related stories about small-time sleazoids

9 0.85786963 576 andrew gelman stats-2011-02-15-With a bit of precognition, you’d have known I was going to post again on this topic, and with a lot of precognition, you’d have known I was going to post today

10 0.85552621 1779 andrew gelman stats-2013-03-27-“Two Dogmas of Strong Objective Bayesianism”

11 0.8540951 1908 andrew gelman stats-2013-06-21-Interpreting interactions in discrete-data regression

12 0.85377896 834 andrew gelman stats-2011-08-01-I owe it all to the haters

13 0.8527689 994 andrew gelman stats-2011-11-06-Josh Tenenbaum presents . . . a model of folk physics!

14 0.85189617 1794 andrew gelman stats-2013-04-09-My talks in DC and Baltimore this week

15 0.85048634 902 andrew gelman stats-2011-09-12-The importance of style in academic writing

16 0.84853745 2353 andrew gelman stats-2014-05-30-I posted this as a comment on a sociology blog

17 0.84664953 1435 andrew gelman stats-2012-07-30-Retracted articles and unethical behavior in economics journals?

18 0.84401369 120 andrew gelman stats-2010-06-30-You can’t put Pandora back in the box

19 0.84396023 1774 andrew gelman stats-2013-03-22-Likelihood Ratio ≠ 1 Journal

20 0.84385252 1865 andrew gelman stats-2013-05-20-What happened that the journal Psychological Science published a paper with no identifiable strengths?