andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-1725 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Jordan Ellenberg writes: Lots of people sharing this today. Isn’t this exactly the kind of situation where they should have done some kind of shrinkage towards the national mean, as in that thing you wrote about kidney cancer rates by county? i.e. you see, just as you might expect, the extreme values of “proportion of people who said they were gay” are disproportionately taken by small states. My reply: If I don’t have the individual-level survey data that would allow me to do full-scale Mister P , yes, I’d fit a multilevel model to the state-level averages. I wouldn’t quite just partially pool toward the national mean; I think it would make sense to include some state-level predictors. In any case, I think it’s tacky to report poll numbers to fractional percentage points. That kind of precision simply isn’t there. P.S. More discussion of variances of large and small states in the comments .
sentIndex sentText sentNum sentScore
1 Jordan Ellenberg writes: Lots of people sharing this today. [sent-1, score-0.155]
2 Isn’t this exactly the kind of situation where they should have done some kind of shrinkage towards the national mean, as in that thing you wrote about kidney cancer rates by county? [sent-2, score-1.648]
3 you see, just as you might expect, the extreme values of “proportion of people who said they were gay” are disproportionately taken by small states. [sent-5, score-0.728]
4 My reply: If I don’t have the individual-level survey data that would allow me to do full-scale Mister P , yes, I’d fit a multilevel model to the state-level averages. [sent-6, score-0.388]
5 I wouldn’t quite just partially pool toward the national mean; I think it would make sense to include some state-level predictors. [sent-7, score-0.807]
6 In any case, I think it’s tacky to report poll numbers to fractional percentage points. [sent-8, score-0.832]
7 More discussion of variances of large and small states in the comments . [sent-12, score-0.574]
wordName wordTfidf (topN-words)
[('kind', 0.28), ('ellenberg', 0.235), ('fractional', 0.205), ('national', 0.201), ('tacky', 0.196), ('jordan', 0.186), ('disproportionately', 0.183), ('shrinkage', 0.176), ('mister', 0.174), ('variances', 0.174), ('partially', 0.172), ('county', 0.17), ('isn', 0.168), ('pool', 0.162), ('towards', 0.16), ('sharing', 0.155), ('precision', 0.152), ('gay', 0.149), ('small', 0.148), ('cancer', 0.143), ('mean', 0.142), ('poll', 0.141), ('proportion', 0.133), ('situation', 0.122), ('extreme', 0.121), ('percentage', 0.117), ('allow', 0.116), ('rates', 0.113), ('multilevel', 0.104), ('values', 0.103), ('toward', 0.103), ('taken', 0.102), ('expect', 0.1), ('states', 0.098), ('exactly', 0.097), ('simply', 0.089), ('include', 0.089), ('survey', 0.087), ('report', 0.087), ('wouldn', 0.087), ('numbers', 0.086), ('yes', 0.083), ('fit', 0.081), ('quite', 0.08), ('comments', 0.079), ('done', 0.076), ('large', 0.075), ('lots', 0.073), ('reply', 0.072), ('said', 0.071)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 1725 andrew gelman stats-2013-02-17-“1.7%” ha ha ha
Introduction: Jordan Ellenberg writes: Lots of people sharing this today. Isn’t this exactly the kind of situation where they should have done some kind of shrinkage towards the national mean, as in that thing you wrote about kidney cancer rates by county? i.e. you see, just as you might expect, the extreme values of “proportion of people who said they were gay” are disproportionately taken by small states. My reply: If I don’t have the individual-level survey data that would allow me to do full-scale Mister P , yes, I’d fit a multilevel model to the state-level averages. I wouldn’t quite just partially pool toward the national mean; I think it would make sense to include some state-level predictors. In any case, I think it’s tacky to report poll numbers to fractional percentage points. That kind of precision simply isn’t there. P.S. More discussion of variances of large and small states in the comments .
2 0.13008526 1931 andrew gelman stats-2013-07-09-“Frontiers in Massive Data Analysis”
Introduction: Mike Jordan sends along this National Academies report on “big data.” This is not a research report but it could be interesting in that it conveys what are believed to be important technical challenges.
3 0.12735949 21 andrew gelman stats-2010-05-07-Environmentally induced cancer “grossly underestimated”? Doubtful.
Introduction: The (U.S.) “President’s Cancer Panel” has released its 2008-2009 annual report, which includes a cover letter that says “the true burden of environmentally induced cancer has been grossly underestimated.” The report itself discusses exposures to various types of industrial chemicals, some of which are known carcinogens, in some detail, but gives nearly no data or analysis to suggest that these exposures are contributing to significant numbers of cancers. In fact, there is pretty good evidence that they are not. The plot above shows age-adjusted cancer mortality for men, by cancer type, in the U.S. The plot below shows the same for women. In both cases, the cancers with the highest mortality rates are shown, but not all cancers (e.g. brain cancer is not shown). For what it’s worth, I’m not sure how trustworthy the rates are from the 1930s — it seems possible that reporting, autopsies, or both, were less careful during the Great Depression — so I suggest focusing on the r
4 0.12254514 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health
Introduction: When it rains it pours . . . John Transue writes: I saw a post on Andrew Sullivan’s blog today about life expectancy in different US counties. With a bunch of the worst counties being in Mississippi, I thought that it might be another case of analysts getting extreme values from small counties. However, the paper (see here ) includes a pretty interesting methods section. This is from page 5, “Specifically, we used a mixed-effects Poisson regression with time, geospatial, and covariate components. Poisson regression fits count outcome variables, e.g., death counts, and is preferable to a logistic model because the latter is biased when an outcome is rare (occurring in less than 1% of observations).” They have downloadable data. I believe that the data are predicted values from the model. A web appendix also gives 90% CIs for their estimates. Do you think they solved the small county problem and that the worst counties really are where their spreadsheet suggests? My re
5 0.1206913 653 andrew gelman stats-2011-04-08-Multilevel regression with shrinkage for “fixed” effects
Introduction: Dean Eckles writes: I remember reading on your blog that you were working on some tools to fit multilevel models that also include “fixed” effects — such as continuous predictors — that are also estimated with shrinkage (for example, an L1 or L2 penalty). Any new developments on this front? I often find myself wanting to fit a multilevel model to some data, but also needing to include a number of “fixed” effects, mainly continuous variables. This makes me wary of overfitting to these predictors, so then I’d want to use some kind of shrinkage. As far as I can tell, the main options for doing this now is by going fully Bayesian and using a Gibbs sampler. With MCMCglmm or BUGS/JAGS I could just specify a prior on the fixed effects that corresponds to a desired penalty. However, this is pretty slow, especially with a large data set and because I’d like to select the penalty parameter by cross-validation (which is where this isn’t very Bayesian I guess?). My reply: We allow info
8 0.096973255 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?
9 0.096728578 1544 andrew gelman stats-2012-10-22-Is it meaningful to talk about a probability of “65.7%” that Obama will win the election?
10 0.096501224 686 andrew gelman stats-2011-04-29-What are the open problems in Bayesian statistics??
11 0.094196968 1315 andrew gelman stats-2012-05-12-Question 2 of my final exam for Design and Analysis of Sample Surveys
12 0.093969561 383 andrew gelman stats-2010-10-31-Analyzing the entire population rather than a sample
13 0.09101595 1732 andrew gelman stats-2013-02-22-Evaluating the impacts of welfare reform?
14 0.089344151 1288 andrew gelman stats-2012-04-29-Clueless Americans think they’ll never get sick
15 0.089222759 1766 andrew gelman stats-2013-03-16-“Nightshifts Linked to Increased Risk for Ovarian Cancer”
16 0.088270761 2180 andrew gelman stats-2014-01-21-Everything I need to know about Bayesian statistics, I learned in eight schools.
17 0.088027559 1989 andrew gelman stats-2013-08-20-Correcting for multiple comparisons in a Bayesian regression model
18 0.087738656 389 andrew gelman stats-2010-11-01-Why it can be rational to vote
19 0.087738656 1565 andrew gelman stats-2012-11-06-Why it can be rational to vote
20 0.086321935 255 andrew gelman stats-2010-09-04-How does multilevel modeling affect the estimate of the grand mean?
topicId topicWeight
[(0, 0.16), (1, 0.014), (2, 0.116), (3, -0.017), (4, 0.039), (5, 0.007), (6, 0.011), (7, -0.009), (8, 0.036), (9, 0.005), (10, 0.043), (11, -0.052), (12, 0.025), (13, 0.038), (14, -0.01), (15, 0.012), (16, -0.008), (17, 0.005), (18, 0.03), (19, 0.006), (20, -0.002), (21, 0.038), (22, -0.051), (23, -0.002), (24, -0.067), (25, -0.038), (26, -0.031), (27, -0.015), (28, -0.032), (29, 0.034), (30, -0.008), (31, 0.027), (32, -0.009), (33, -0.001), (34, -0.003), (35, 0.011), (36, 0.037), (37, 0.001), (38, 0.035), (39, 0.057), (40, -0.019), (41, 0.014), (42, -0.018), (43, -0.007), (44, 0.001), (45, 0.017), (46, 0.007), (47, -0.012), (48, -0.022), (49, 0.005)]
simIndex simValue blogId blogTitle
same-blog 1 0.95344645 1725 andrew gelman stats-2013-02-17-“1.7%” ha ha ha
Introduction: Jordan Ellenberg writes: Lots of people sharing this today. Isn’t this exactly the kind of situation where they should have done some kind of shrinkage towards the national mean, as in that thing you wrote about kidney cancer rates by county? i.e. you see, just as you might expect, the extreme values of “proportion of people who said they were gay” are disproportionately taken by small states. My reply: If I don’t have the individual-level survey data that would allow me to do full-scale Mister P , yes, I’d fit a multilevel model to the state-level averages. I wouldn’t quite just partially pool toward the national mean; I think it would make sense to include some state-level predictors. In any case, I think it’s tacky to report poll numbers to fractional percentage points. That kind of precision simply isn’t there. P.S. More discussion of variances of large and small states in the comments .
2 0.73192906 405 andrew gelman stats-2010-11-10-Estimation from an out-of-date census
Introduction: Suguru Mizunoya writes: When we estimate the number of people from a national sampling survey (such as labor force survey) using sampling weights, don’t we obtain underestimated number of people, if the country’s population is growing and the sampling frame is based on an old census data? In countries with increasing populations, the probability of inclusion changes over time, but the weights can’t be adjusted frequently because census takes place only once every five or ten years. I am currently working for UNICEF for a project on estimating number of out-of-school children in developing countries. The project leader is comfortable to use estimates of number of people from DHS and other surveys. But, I am concerned that we may need to adjust the estimated number of people by the population projection, otherwise the estimates will be underestimated. I googled around on this issue, but I could not find a right article or paper on this. My reply: I don’t know if there’s a pa
Introduction: I dodged a bullet the other day, blogorifically speaking. This is a (moderately) long story but there’s a payoff at the end for those of you who are interested in forecasting or understanding voting and public opinion at the state level. Act 1 It started when Jeff Lax made this comment on his recent blog entry: Nebraska Is All That Counts for a Party-Bucking Nelson Dem Senator On Blowback From His Opposition To Kagan: ‘Are They From Nebraska? Then I Don’t Care’ Fine, but 62% of Nebraskans with an opinion favor confirmation… 91% of Democrats, 39% of Republicans, and 61% of Independents. So I guess he only cares about Republican Nebraskans… I conferred with Jeff and then wrote the following entry for fivethirtyeight.com. There was a backlog of posts at 538 at the time, so I set it on delay to appear the following morning. Here’s my post (which I ended up deleting before it ever appeared): Party-Bucking Nelson May Be Nebraska-Bucking as Well Under the head
4 0.71565562 962 andrew gelman stats-2011-10-17-Death!
Introduction: This graph shows the estimate that Kenny Shirley and I have of support for the death penalty by sex and race in the U.S. since 1955: We also found that capital punishment used to be more popular in the Northeast than in the South, but now it’s the other way around. Here’s the abstract to our paper : One of the longest running questions that has been regularly included in Gallup’s national public opinion poll is “Do you favor or oppose the death penalty for persons convicted of murder?” Because the death penalty is governed by state laws rather than federal laws, it is of special interest to know how public opinion varies by state, and how it has changed over time within each state. In this paper we combine dozens of national polls taken over a fifty-year span and fit a Bayesian multilevel logistic regression model to individual response data to estimate changes in state-level public opinion over time. Such a long span of polls has not been analyzed this way before, partly
5 0.70884824 2152 andrew gelman stats-2013-12-28-Using randomized incentives as an instrument for survey nonresponse?
Introduction: I received the following question: Is there a classic paper on instrumenting for survey non-response? some colleagues in public health are going to carry out a survey and I wonder about suggesting that they build in a randomization of response-encouragement (e.g. offering additional $ to a subset of those who don’t respond initially). Can you recommend a basic treatment of this, and why it might or might not make sense compared to IPW using covariates (without an instrument)? My reply: Here’s the best analysis I know of on the effects of incentives for survey response. There have been several survey-experiments on the subject. The short answer is that the effect on nonresponse is small and the outcome is highly variable, hence you can’t very well use it as an instrument in any particular survey. My recommended approach to dealing with nonresponse is to use multilevel regression and poststratification; an example is here . Inverse-probability weighting doesn’t really w
6 0.70731038 1900 andrew gelman stats-2013-06-15-Exploratory multilevel analysis when group-level variables are of importance
8 0.70555747 1294 andrew gelman stats-2012-05-01-Modeling y = a + b + c
9 0.70499939 784 andrew gelman stats-2011-07-01-Weighting and prediction in sample surveys
10 0.70361561 352 andrew gelman stats-2010-10-19-Analysis of survey data: Design based models vs. hierarchical modeling?
12 0.6891107 2167 andrew gelman stats-2014-01-10-Do you believe that “humans and other living things have evolved over time”?
13 0.68296003 12 andrew gelman stats-2010-04-30-More on problems with surveys estimating deaths in war zones
14 0.67708957 454 andrew gelman stats-2010-12-07-Diabetes stops at the state line?
15 0.67568129 1940 andrew gelman stats-2013-07-16-A poll that throws away data???
16 0.6753096 70 andrew gelman stats-2010-06-07-Mister P goes on a date
17 0.67355257 245 andrew gelman stats-2010-08-31-Predicting marathon times
18 0.6725207 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable
20 0.66577876 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health
topicId topicWeight
[(16, 0.056), (24, 0.177), (36, 0.022), (55, 0.019), (66, 0.02), (82, 0.156), (84, 0.022), (86, 0.039), (99, 0.378)]
simIndex simValue blogId blogTitle
1 0.98566294 940 andrew gelman stats-2011-10-03-It depends upon what the meaning of the word “firm” is.
Introduction: David Hogg pointed me to this news article by Angela Saini: It’s not often that the quiet world of mathematics is rocked by a murder case. But last summer saw a trial that sent academics into a tailspin, and has since swollen into a fevered clash between science and the law. At its heart, this is a story about chance. And it begins with a convicted killer, “T”, who took his case to the court of appeal in 2010. Among the evidence against him was a shoeprint from a pair of Nike trainers, which seemed to match a pair found at his home. While appeals often unmask shaky evidence, this was different. This time, a mathematical formula was thrown out of court. The footwear expert made what the judge believed were poor calculations about the likelihood of the match, compounded by a bad explanation of how he reached his opinion. The conviction was quashed. . . . “The impact will be quite shattering,” says Professor Norman Fenton, a mathematician at Queen Mary, University of London.
Introduction: Greg Campbell writes: I am a Canadian archaeologist (BSc in Chemistry) researching the past human use of European Atlantic shellfish. After two decades of practice I am finally getting a MA in archaeology at Reading. I am seeing if the habitat or size of harvested mussels (Mytilus edulis) can be reconstructed from measurements of the umbo (the pointy end, and the only bit that survives well in archaeological deposits) using log-transformed measurements (or allometry; relationships between dimensions are more likely exponential than linear). Of course multivariate regressions in most statistics packages (Minitab, SPSS, SAS) assume you are trying to predict one variable from all the others (a Model I regression), and use ordinary least squares to fit the regression line. For organismal dimensions this makes little sense, since all the dimensions are (at least in theory) free to change their mutual proportions during growth. So there is no predictor and predicted, mutual variation of
3 0.97955406 1134 andrew gelman stats-2012-01-21-Lessons learned from a recent R package submission
Introduction: R has zillions of packages, and people are submitting new ones each day . The volunteers who keep R going are doing an incredibly useful service to the profession, and they’re busy . A colleague sends in some suugestions based on a recent experience with a package update: 1. Always use the R dev version to write a package. Not the current stable release. The R people use the R dev version to check your package anyway. If you don’t use the R dev version, there is chance that your package won’t pass the check. In my own experience, every time R has a major change, it tends to have new standards and find new errors in your package with these new standards. So better use the dev version to find out the potential errors in advance. 2. After submission, write an email to claim it. I used to submit the package to the CRAN without writing an email. This was standard operating procedure, but it has changed. Writing an email to claim about the submission is now a requir
same-blog 4 0.97954255 1725 andrew gelman stats-2013-02-17-“1.7%” ha ha ha
Introduction: Jordan Ellenberg writes: Lots of people sharing this today. Isn’t this exactly the kind of situation where they should have done some kind of shrinkage towards the national mean, as in that thing you wrote about kidney cancer rates by county? i.e. you see, just as you might expect, the extreme values of “proportion of people who said they were gay” are disproportionately taken by small states. My reply: If I don’t have the individual-level survey data that would allow me to do full-scale Mister P , yes, I’d fit a multilevel model to the state-level averages. I wouldn’t quite just partially pool toward the national mean; I think it would make sense to include some state-level predictors. In any case, I think it’s tacky to report poll numbers to fractional percentage points. That kind of precision simply isn’t there. P.S. More discussion of variances of large and small states in the comments .
5 0.97722346 335 andrew gelman stats-2010-10-11-How to think about Lou Dobbs
Introduction: I was unsurprised to read that Lou Dobbs, the former CNN host who crusaded against illegal immigrants, had actually hired a bunch of them himself to maintain his large house and his horse farm. (OK, I have to admit I was surprised by the part about the horse farm.) But I think most of the reactions to this story missed the point. Isabel Macdonald’s article that broke the story was entitled, “Lou Dobbs, American Hypocrite,” and most of the discussion went from there, with some commenters piling on Dobbs and others defending him by saying that Dobbs hired his laborers through contractors and may not have known they were in the country illegally. To me, though, the key issue is slightly different. And Macdonald’s story is relevant whether or not Dobbs knew he was hiring illegals. My point is not that Dobbs is a bad guy, or a hypocrite, or whatever. My point is that, in his setting, it would take an extraordinary effort to not hire illegal immigrants to take care of his house
6 0.97529042 340 andrew gelman stats-2010-10-13-Randomized experiments, non-randomized experiments, and observational studies
7 0.97398084 326 andrew gelman stats-2010-10-07-Peer pressure, selection, and educational reform
8 0.97251117 1553 andrew gelman stats-2012-10-30-Real rothko, fake rothko
10 0.96865982 178 andrew gelman stats-2010-08-03-(Partisan) visualization of health care legislation
11 0.96703207 2003 andrew gelman stats-2013-08-30-Stan Project: Continuous Relaxations for Discrete MRFs
12 0.96647525 67 andrew gelman stats-2010-06-03-More on that Dartmouth health care study
14 0.96391964 1682 andrew gelman stats-2013-01-19-R package for Bayes factors
15 0.96228063 1488 andrew gelman stats-2012-09-08-Annals of spam
16 0.959782 1836 andrew gelman stats-2013-05-02-Culture clash
17 0.95651323 1440 andrew gelman stats-2012-08-02-“A Christmas Carol” as applied to plagiarism
18 0.95577472 931 andrew gelman stats-2011-09-29-Hamiltonian Monte Carlo stories
19 0.95410824 983 andrew gelman stats-2011-10-31-Skepticism about skepticism of global warming skepticism skepticism
20 0.95156091 535 andrew gelman stats-2011-01-24-Bleg: Automatic Differentiation for Log Prob Gradients?