andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-589 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news: It’s also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot — especially given the evidence that people don’t always understand scatterplots. He compares pairs of cases that don’t illustrate the basic hypothesis of Brooks, Scott Walker, et al. Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not. This idea–summarizing a bivariate pattern by comparing pairs of points–reminds me of a well-known statistical identities which I refer to in a paper with David Park: John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope. Pretty cool, and
sentIndex sentText sentNum sentScore
1 John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news: It’s also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot — especially given the evidence that people don’t always understand scatterplots. [sent-1, score-1.568]
2 He compares pairs of cases that don’t illustrate the basic hypothesis of Brooks, Scott Walker, et al. [sent-2, score-0.693]
3 Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not. [sent-3, score-0.94]
4 This idea–summarizing a bivariate pattern by comparing pairs of points–reminds me of a well-known statistical identities which I refer to in a paper with David Park: John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. [sent-4, score-1.603]
5 But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope. [sent-5, score-0.407]
6 Pretty cool, and it helps develop our intuition about the big-picture relevance of special-case comparisons. [sent-6, score-0.415]
wordName wordTfidf (topN-words)
[('scatterplot', 0.263), ('pairs', 0.243), ('pair', 0.236), ('comparisons', 0.233), ('sides', 0.216), ('misleading', 0.196), ('depicted', 0.184), ('identities', 0.184), ('pick', 0.171), ('hayes', 0.166), ('cable', 0.16), ('deficits', 0.155), ('walker', 0.151), ('appropriately', 0.151), ('chooses', 0.145), ('points', 0.144), ('bivariate', 0.137), ('john', 0.13), ('onto', 0.121), ('compares', 0.12), ('park', 0.12), ('summarizing', 0.12), ('budget', 0.116), ('tv', 0.115), ('brooks', 0.114), ('scott', 0.109), ('helps', 0.108), ('journalist', 0.107), ('intuition', 0.106), ('refer', 0.105), ('illustrate', 0.104), ('relevance', 0.103), ('systematic', 0.101), ('relationship', 0.1), ('discusses', 0.099), ('develop', 0.098), ('obviously', 0.097), ('chris', 0.091), ('reminds', 0.09), ('extremely', 0.089), ('given', 0.089), ('cool', 0.088), ('comparing', 0.087), ('rates', 0.083), ('pattern', 0.083), ('et', 0.078), ('hypothesis', 0.074), ('basic', 0.074), ('explain', 0.074), ('correct', 0.072)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 589 andrew gelman stats-2011-02-24-On summarizing a noisy scatterplot with a single comparison of two points
Introduction: John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news: It’s also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot — especially given the evidence that people don’t always understand scatterplots. He compares pairs of cases that don’t illustrate the basic hypothesis of Brooks, Scott Walker, et al. Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not. This idea–summarizing a bivariate pattern by comparing pairs of points–reminds me of a well-known statistical identities which I refer to in a paper with David Park: John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope. Pretty cool, and
2 0.11010405 1989 andrew gelman stats-2013-08-20-Correcting for multiple comparisons in a Bayesian regression model
Introduction: Joe Northrup writes: I have a question about correcting for multiple comparisons in a Bayesian regression model. I believe I understand the argument in your 2012 paper in Journal of Research on Educational Effectiveness that when you have a hierarchical model there is shrinkage of estimates towards the group-level mean and thus there is no need to add any additional penalty to correct for multiple comparisons. In my case I do not have hierarchically structured data—i.e. I have only 1 observation per group but have a categorical variable with a large number of categories. Thus, I am fitting a simple multiple regression in a Bayesian framework. Would putting a strong, mean 0, multivariate normal prior on the betas in this model accomplish the same sort of shrinkage (it seems to me that it would) and do you believe this is a valid way to address criticism of multiple comparisons in this setting? My reply: Yes, I think this makes sense. One way to address concerns of multiple com
3 0.10165124 1062 andrew gelman stats-2011-12-16-Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets
Introduction: Jeremy Fox asks what I think about this paper by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti which proposes a new nonlinear R-squared-like measure. My quick answer is that it looks really cool! From my quick reading of the paper, it appears that the method reduces on average to the usual R-squared when fit to data of the form y = a + bx + error, and that it also has a similar interpretation when “a + bx” is replaced by other continuous functions. Unlike R-squared, the method of Reshef et al. depends on a tuning parameter that controls the level of discretization, in a “How long is the coast of Britain” sort of way. The dependence on scale is inevitable for such a general method. Just consider: if you sample 1000 points from the unit bivariate normal distribution, (x,y) ~ N(0,I), you’ll be able to fit them perfectly by a 999-degree polynomial fit to the data. So the sca
4 0.10008097 697 andrew gelman stats-2011-05-05-A statistician rereads Bill James
Introduction: Ben Lindbergh invited me to write an article for Baseball Prospectus. I first sent him this item on the differences between baseball and politics but he said it was too political for them. I then sent him this review of a book on baseball’s greatest fielders but he said they already had someone slotted to review that book. Then I sent him some reflections on the great Bill James and he published it ! If anybody out there knows Bill James, please send this on to him: I have some questions at the end that I’m curious about. Here’s how it begins: I read my first Bill James book in 1984, took my first statistics class in 1985, and began graduate study in statistics the next year. Besides giving me the opportunity to study with the best applied statistician of the late 20th century (Don Rubin) and the best theoretical statistician of the early 21st (Xiao-Li Meng), going to graduate school at Harvard in 1986 gave me the opportunity to sit in a basement room one evening that
5 0.10004745 2266 andrew gelman stats-2014-03-25-A statistical graphics course and statistical graphics advice
Introduction: Dean Eckles writes: Some of my coworkers at Facebook and I have worked with Udacity to create an online course on exploratory data analysis, including using data visualizations in R as part of EDA. The course has now launched at https://www.udacity.com/course/ud651 so anyone can take it for free. And Kaiser Fung has reviewed it . So definitely feel free to promote it! Criticism is also welcome (we are still fine-tuning things and adding more notes throughout). I wrote some more comments about the course here , including highlighting the interviews with my great coworkers. I didn’t have a chance to look at the course so instead I responded with some generic comments about eda and visualization (in no particular order): - Think of a graph as a comparison. All graphs are comparison (indeed, all statistical analyses are comparisons). If you already have the graph in mind, think of what comparisons it’s enabling. Or if you haven’t settled on the graph yet, think of what
6 0.095507383 1496 andrew gelman stats-2012-09-14-Sides and Vavreck on the 2012 election
7 0.094940349 940 andrew gelman stats-2011-10-03-It depends upon what the meaning of the word “firm” is.
9 0.089027964 144 andrew gelman stats-2010-07-13-Hey! Here’s a referee report for you!
10 0.08712019 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients
11 0.083837904 2041 andrew gelman stats-2013-09-27-Setting up Jitts online
12 0.081594035 691 andrew gelman stats-2011-05-03-Psychology researchers discuss ESP
13 0.080292918 1845 andrew gelman stats-2013-05-07-Is Felix Salmon wrong on free TV?
16 0.077169709 1376 andrew gelman stats-2012-06-12-Simple graph WIN: the example of birthday frequencies
20 0.070366934 99 andrew gelman stats-2010-06-19-Paired comparisons
topicId topicWeight
[(0, 0.115), (1, -0.023), (2, 0.024), (3, -0.012), (4, 0.032), (5, -0.067), (6, -0.037), (7, 0.029), (8, -0.006), (9, -0.003), (10, -0.023), (11, 0.021), (12, -0.019), (13, -0.012), (14, 0.012), (15, 0.032), (16, -0.013), (17, -0.006), (18, -0.015), (19, -0.047), (20, -0.015), (21, 0.024), (22, 0.009), (23, 0.003), (24, 0.004), (25, -0.003), (26, 0.026), (27, 0.01), (28, 0.021), (29, -0.015), (30, 0.035), (31, 0.057), (32, 0.04), (33, -0.04), (34, -0.014), (35, 0.002), (36, 0.024), (37, -0.042), (38, -0.018), (39, -0.008), (40, -0.01), (41, 0.059), (42, 0.019), (43, -0.032), (44, -0.003), (45, -0.011), (46, 0.024), (47, -0.015), (48, 0.008), (49, -0.019)]
simIndex simValue blogId blogTitle
same-blog 1 0.97672331 589 andrew gelman stats-2011-02-24-On summarizing a noisy scatterplot with a single comparison of two points
Introduction: John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news: It’s also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot — especially given the evidence that people don’t always understand scatterplots. He compares pairs of cases that don’t illustrate the basic hypothesis of Brooks, Scott Walker, et al. Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not. This idea–summarizing a bivariate pattern by comparing pairs of points–reminds me of a well-known statistical identities which I refer to in a paper with David Park: John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope. Pretty cool, and
Introduction: Etienne LeBel writes: You’ve probably already seen it, but I thought you could have a lot of fun with this one!! The article , with the admirably clear title given above, is by James McNulty, Michael Olson, Andrea Meltzer, Matthew Shaffer, and begins as follows: For decades, social psychological theories have posited that the automatic processes captured by implicit measures have implications for social outcomes. Yet few studies have demonstrated any long-term implications of automatic processes, and some scholars have begun to question the relevance and even the validity of these theories. At baseline of our longitudinal study, 135 newlywed couples (270 individuals) completed an explicit measure of their conscious attitudes toward their relationship and an implicit measure of their automatic attitudes toward their partner. They then reported their marital satisfaction every 6 months for the next 4 years. We found no correlation between spouses’ automatic and conscious attitu
Introduction: To understand the above title, see here . Masanao writes: This report claims that eating meat increases the risk of cancer. I’m sure you can’t read the page but you probably can understand the graphs. Different bars represent subdivision in the amount of the particular type of meat one consumes. And each chunk is different types of meat. Left is for male right is for female. They claim that the difference is significant, but they are clearly not!! I’m for not eating much meat but this is just way too much… Here’s the graph: I don’t know what to think. If you look carefully you can find one or two statistically significant differences but overall the pattern doesn’t look so compelling. I don’t know what the top and bottom rows are, though. Overall, the pattern in the top row looks like it could represent a real trend, while the graphs on the bottom row look like noise. This could be a good example for our multiple comparisons paper. If the researchers won’t
4 0.65131927 2091 andrew gelman stats-2013-11-06-“Marginally significant”
Introduction: Jeremy Fox writes: You’ve probably seen this [by Matthew Hankins]. . . . Everyone else on Twitter already has. It’s a graph of the frequency with which the phrase “marginally significant” occurs in association with different P values. Apparently it’s real data, from a Google Scholar search, though I haven’t tried to replicate the search myself. My reply: I admire the effort that went into the data collection and the excellent display (following Bill Cleveland etc., I’d prefer a landscape rather than portrait orientation of the graph, also I’d prefer a gritty histogram rather than a smooth density, and I don’t like the y-axis going below zero, nor do I like the box around the graph, also there’s that weird R default where the axis labels are so far from the actual axes, I don’t know whassup with that . . . but these are all minor, minor issues, certainly I’ve done much worse myself many times even in published articles; see the presentation here for lots of examples), an
Introduction: Greg Kaplan writes: I noticed that you have blogged a little about interstate migration trends in the US, and thought that you might be interested in a new working paper of mine (joint with Sam Schulhofer-Wohl from the Minneapolis Fed) which I have attached. Briefly, we show that much of the recent reported drop in interstate migration is a statistical artifact: The Census Bureau made an undocumented change in its imputation procedures for missing data in 2006, and this change significantly reduced the number of imputed interstate moves. The change in imputation procedures — not any actual change in migration behavior — explains 90 percent of the reported decrease in interstate migration between the 2005 and 2006 Current Population Surveys, and 42 percent of the decrease between 2000 and 2010. I haven’t had a chance to give a serious look so could only make the quick suggestion to make the graphs smaller and put multiple graphs on a page, This would allow the reader to bett
6 0.63826567 2076 andrew gelman stats-2013-10-24-Chasing the noise: W. Edwards Deming would be spinning in his grave
8 0.62099868 609 andrew gelman stats-2011-03-13-Coauthorship norms
9 0.61597937 486 andrew gelman stats-2010-12-26-Age and happiness: The pattern isn’t as clear as you might think
10 0.61300623 1171 andrew gelman stats-2012-02-16-“False-positive psychology”
11 0.61256409 1062 andrew gelman stats-2011-12-16-Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets
12 0.61254907 293 andrew gelman stats-2010-09-23-Lowess is great
13 0.61159039 61 andrew gelman stats-2010-05-31-A data visualization manifesto
14 0.60353178 1805 andrew gelman stats-2013-04-16-Memo to Reinhart and Rogoff: I think it’s best to admit your errors and go on from there
15 0.60311431 2065 andrew gelman stats-2013-10-17-Cool dynamic demographic maps provide beautiful illustration of Chris Rock effect
17 0.59905946 488 andrew gelman stats-2010-12-27-Graph of the year
18 0.59733802 2247 andrew gelman stats-2014-03-14-The maximal information coefficient
19 0.59591693 991 andrew gelman stats-2011-11-04-Insecure researchers aren’t sharing their data
20 0.59476376 748 andrew gelman stats-2011-06-06-Why your Klout score is meaningless
topicId topicWeight
[(16, 0.027), (24, 0.089), (99, 0.772)]
simIndex simValue blogId blogTitle
same-blog 1 0.99995983 589 andrew gelman stats-2011-02-24-On summarizing a noisy scatterplot with a single comparison of two points
Introduction: John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news: It’s also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot — especially given the evidence that people don’t always understand scatterplots. He compares pairs of cases that don’t illustrate the basic hypothesis of Brooks, Scott Walker, et al. Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not. This idea–summarizing a bivariate pattern by comparing pairs of points–reminds me of a well-known statistical identities which I refer to in a paper with David Park: John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope. Pretty cool, and
2 0.99954706 1315 andrew gelman stats-2012-05-12-Question 2 of my final exam for Design and Analysis of Sample Surveys
Introduction: 2. Which of the following are useful goals in a pilot study? (Indicate all that apply.) (a) You can search for statistical significance, then from that decide what to look for in a confirmatory analysis of your full dataset. (b) You can see if you find statistical significance in a pre-chosen comparison of interest. (c) You can examine the direction (positive or negative, even if not statistically significant) of comparisons of interest. (d) With a small sample size, you cannot hope to learn anything conclusive, but you can get a crude estimate of effect size and standard deviation which will be useful in a power analysis to help you decide how large your full study needs to be. (e) You can talk with survey respondents and get a sense of how they perceived your questions. (f) You get a chance to learn about practical difficulties with sampling, nonresponse, and question wording. (g) You can check if your sample is approximately representative of your population. Soluti
3 0.99920475 1431 andrew gelman stats-2012-07-27-Overfitting
Introduction: Ilya Esteban writes: In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc., all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best. While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? If you try several different
4 0.99895602 772 andrew gelman stats-2011-06-17-Graphical tools for understanding multilevel models
Introduction: There are a few things I want to do: 1. Understand a fitted model using tools such as average predictive comparisons , R-squared, and partial pooling factors . In defining these concepts, Iain and I came up with some clever tricks, including (but not limited to): - Separating the inputs and averaging over all possible values of the input not being altered (for average predictive comparisons); - Defining partial pooling without referring to a raw-data or maximum-likelihood or no-pooling estimate (these don’t necessarily exist when you’re fitting logistic regression with sparse data); - Defining an R-squared for each level of a multilevel model. The methods get pretty complicated, though, and they have some loose ends–in particular, for average predictive comparisons with continuous input variables. So now we want to implement these in R and put them into arm along with bglmer etc. 2. Setting up coefplot so it works more generally (that is, so the graphics look nice
5 0.99892658 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable
Introduction: Jay Ulfelder asks: I have a question for you about what to do in a situation where you have two measures of your dependent variable and no prior reasons to strongly favor one over the other. Here’s what brings this up: I’m working on a project with Michael Ross where we’re modeling transitions to and from democracy in countries worldwide since 1960 to estimate the effects of oil income on the likelihood of those events’ occurrence. We’ve got a TSCS data set, and we’re using a discrete-time event history design, splitting the sample by regime type at the start of each year and then using multilevel logistic regression models with parametric measures of time at risk and random intercepts at the country and region levels. (We’re also checking for the usefulness of random slopes for oil wealth at one or the other level and then including them if they improve a model’s goodness of fit.) All of this is being done in Stata with the gllamm module. Our problem is that we have two plausib
7 0.99891663 1434 andrew gelman stats-2012-07-29-FindTheData.org
8 0.99887508 521 andrew gelman stats-2011-01-17-“the Tea Party’s ire, directed at Democrats and Republicans alike”
9 0.99851584 1425 andrew gelman stats-2012-07-23-Examples of the use of hierarchical modeling to generalize to new settings
10 0.99808592 638 andrew gelman stats-2011-03-30-More on the correlation between statistical and political ideology
11 0.9975993 756 andrew gelman stats-2011-06-10-Christakis-Fowler update
12 0.99745584 1813 andrew gelman stats-2013-04-19-Grad students: Participate in an online survey on statistics education
13 0.99738616 180 andrew gelman stats-2010-08-03-Climate Change News
15 0.99710482 740 andrew gelman stats-2011-06-01-The “cushy life” of a University of Illinois sociology professor
17 0.99664658 1585 andrew gelman stats-2012-11-20-“I know you aren’t the plagiarism police, but . . .”
18 0.99653083 1483 andrew gelman stats-2012-09-04-“Bestselling Author Caught Posting Positive Reviews of His Own Work on Amazon”
19 0.99640459 1288 andrew gelman stats-2012-04-29-Clueless Americans think they’ll never get sick
20 0.99612433 174 andrew gelman stats-2010-08-01-Literature and life