andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-589 knowledge-graph by maker-knowledge-mining

589 andrew gelman stats-2011-02-24-On summarizing a noisy scatterplot with a single comparison of two points


meta infos for this blog

Source: html

Introduction: John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news: It’s also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot — especially given the evidence that people don’t always understand scatterplots. He compares pairs of cases that don’t illustrate the basic hypothesis of Brooks, Scott Walker, et al. Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not. This idea–summarizing a bivariate pattern by comparing pairs of points–reminds me of a well-known statistical identities which I refer to in a paper with David Park: John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope. Pretty cool, and


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news: It’s also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot — especially given the evidence that people don’t always understand scatterplots. [sent-1, score-1.568]

2 He compares pairs of cases that don’t illustrate the basic hypothesis of Brooks, Scott Walker, et al. [sent-2, score-0.693]

3 Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not. [sent-3, score-0.94]

4 This idea–summarizing a bivariate pattern by comparing pairs of points–reminds me of a well-known statistical identities which I refer to in a paper with David Park: John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. [sent-4, score-1.603]

5 But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope. [sent-5, score-0.407]

6 Pretty cool, and it helps develop our intuition about the big-picture relevance of special-case comparisons. [sent-6, score-0.415]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('scatterplot', 0.263), ('pairs', 0.243), ('pair', 0.236), ('comparisons', 0.233), ('sides', 0.216), ('misleading', 0.196), ('depicted', 0.184), ('identities', 0.184), ('pick', 0.171), ('hayes', 0.166), ('cable', 0.16), ('deficits', 0.155), ('walker', 0.151), ('appropriately', 0.151), ('chooses', 0.145), ('points', 0.144), ('bivariate', 0.137), ('john', 0.13), ('onto', 0.121), ('compares', 0.12), ('park', 0.12), ('summarizing', 0.12), ('budget', 0.116), ('tv', 0.115), ('brooks', 0.114), ('scott', 0.109), ('helps', 0.108), ('journalist', 0.107), ('intuition', 0.106), ('refer', 0.105), ('illustrate', 0.104), ('relevance', 0.103), ('systematic', 0.101), ('relationship', 0.1), ('discusses', 0.099), ('develop', 0.098), ('obviously', 0.097), ('chris', 0.091), ('reminds', 0.09), ('extremely', 0.089), ('given', 0.089), ('cool', 0.088), ('comparing', 0.087), ('rates', 0.083), ('pattern', 0.083), ('et', 0.078), ('hypothesis', 0.074), ('basic', 0.074), ('explain', 0.074), ('correct', 0.072)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 589 andrew gelman stats-2011-02-24-On summarizing a noisy scatterplot with a single comparison of two points

Introduction: John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news: It’s also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot — especially given the evidence that people don’t always understand scatterplots. He compares pairs of cases that don’t illustrate the basic hypothesis of Brooks, Scott Walker, et al. Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not. This idea–summarizing a bivariate pattern by comparing pairs of points–reminds me of a well-known statistical identities which I refer to in a paper with David Park: John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope. Pretty cool, and

2 0.11010405 1989 andrew gelman stats-2013-08-20-Correcting for multiple comparisons in a Bayesian regression model

Introduction: Joe Northrup writes: I have a question about correcting for multiple comparisons in a Bayesian regression model. I believe I understand the argument in your 2012 paper in Journal of Research on Educational Effectiveness that when you have a hierarchical model there is shrinkage of estimates towards the group-level mean and thus there is no need to add any additional penalty to correct for multiple comparisons. In my case I do not have hierarchically structured data—i.e. I have only 1 observation per group but have a categorical variable with a large number of categories. Thus, I am fitting a simple multiple regression in a Bayesian framework. Would putting a strong, mean 0, multivariate normal prior on the betas in this model accomplish the same sort of shrinkage (it seems to me that it would) and do you believe this is a valid way to address criticism of multiple comparisons in this setting? My reply: Yes, I think this makes sense. One way to address concerns of multiple com

3 0.10165124 1062 andrew gelman stats-2011-12-16-Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets

Introduction: Jeremy Fox asks what I think about this paper by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti which proposes a new nonlinear R-squared-like measure. My quick answer is that it looks really cool! From my quick reading of the paper, it appears that the method reduces on average to the usual R-squared when fit to data of the form y = a + bx + error, and that it also has a similar interpretation when “a + bx” is replaced by other continuous functions. Unlike R-squared, the method of Reshef et al. depends on a tuning parameter that controls the level of discretization, in a “How long is the coast of Britain” sort of way. The dependence on scale is inevitable for such a general method. Just consider: if you sample 1000 points from the unit bivariate normal distribution, (x,y) ~ N(0,I), you’ll be able to fit them perfectly by a 999-degree polynomial fit to the data. So the sca

4 0.10008097 697 andrew gelman stats-2011-05-05-A statistician rereads Bill James

Introduction: Ben Lindbergh invited me to write an article for Baseball Prospectus. I first sent him this item on the differences between baseball and politics but he said it was too political for them. I then sent him this review of a book on baseball’s greatest fielders but he said they already had someone slotted to review that book. Then I sent him some reflections on the great Bill James and he published it ! If anybody out there knows Bill James, please send this on to him: I have some questions at the end that I’m curious about. Here’s how it begins: I read my first Bill James book in 1984, took my first statistics class in 1985, and began graduate study in statistics the next year. Besides giving me the opportunity to study with the best applied statistician of the late 20th century (Don Rubin) and the best theoretical statistician of the early 21st (Xiao-Li Meng), going to graduate school at Harvard in 1986 gave me the opportunity to sit in a basement room one evening that

5 0.10004745 2266 andrew gelman stats-2014-03-25-A statistical graphics course and statistical graphics advice

Introduction: Dean Eckles writes: Some of my coworkers at Facebook and I have worked with Udacity to create an online course on exploratory data analysis, including using data visualizations in R as part of EDA. The course has now launched at  https://www.udacity.com/course/ud651  so anyone can take it for free. And Kaiser Fung has  reviewed it . So definitely feel free to promote it! Criticism is also welcome (we are still fine-tuning things and adding more notes throughout). I wrote some more comments about the course  here , including highlighting the interviews with my great coworkers. I didn’t have a chance to look at the course so instead I responded with some generic comments about eda and visualization (in no particular order): - Think of a graph as a comparison. All graphs are comparison (indeed, all statistical analyses are comparisons). If you already have the graph in mind, think of what comparisons it’s enabling. Or if you haven’t settled on the graph yet, think of what

6 0.095507383 1496 andrew gelman stats-2012-09-14-Sides and Vavreck on the 2012 election

7 0.094940349 940 andrew gelman stats-2011-10-03-It depends upon what the meaning of the word “firm” is.

8 0.090300545 1729 andrew gelman stats-2013-02-20-My beef with Brooks: the alternative to “good statistics” is not “no statistics,” it’s “bad statistics”

9 0.089027964 144 andrew gelman stats-2010-07-13-Hey! Here’s a referee report for you!

10 0.08712019 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

11 0.083837904 2041 andrew gelman stats-2013-09-27-Setting up Jitts online

12 0.081594035 691 andrew gelman stats-2011-05-03-Psychology researchers discuss ESP

13 0.080292918 1845 andrew gelman stats-2013-05-07-Is Felix Salmon wrong on free TV?

14 0.079266936 1834 andrew gelman stats-2013-05-01-A graph at war with its caption. Also, how to visualize the same numbers without giving the display a misleading causal feel?

15 0.078054897 1458 andrew gelman stats-2012-08-14-1.5 million people were told that extreme conservatives are happier than political moderates. Approximately .0001 million Americans learned that the opposite is true.

16 0.077169709 1376 andrew gelman stats-2012-06-12-Simple graph WIN: the example of birthday frequencies

17 0.076203272 2042 andrew gelman stats-2013-09-28-Difficulties of using statistical significance (or lack thereof) to sift through and compare research hypotheses

18 0.07364326 878 andrew gelman stats-2011-08-29-Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go

19 0.071049191 274 andrew gelman stats-2010-09-14-Battle of the Americans: Writer at the American Enterprise Institute disparages the American Political Science Association

20 0.070366934 99 andrew gelman stats-2010-06-19-Paired comparisons


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.115), (1, -0.023), (2, 0.024), (3, -0.012), (4, 0.032), (5, -0.067), (6, -0.037), (7, 0.029), (8, -0.006), (9, -0.003), (10, -0.023), (11, 0.021), (12, -0.019), (13, -0.012), (14, 0.012), (15, 0.032), (16, -0.013), (17, -0.006), (18, -0.015), (19, -0.047), (20, -0.015), (21, 0.024), (22, 0.009), (23, 0.003), (24, 0.004), (25, -0.003), (26, 0.026), (27, 0.01), (28, 0.021), (29, -0.015), (30, 0.035), (31, 0.057), (32, 0.04), (33, -0.04), (34, -0.014), (35, 0.002), (36, 0.024), (37, -0.042), (38, -0.018), (39, -0.008), (40, -0.01), (41, 0.059), (42, 0.019), (43, -0.032), (44, -0.003), (45, -0.011), (46, 0.024), (47, -0.015), (48, 0.008), (49, -0.019)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97672331 589 andrew gelman stats-2011-02-24-On summarizing a noisy scatterplot with a single comparison of two points

Introduction: John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news: It’s also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot — especially given the evidence that people don’t always understand scatterplots. He compares pairs of cases that don’t illustrate the basic hypothesis of Brooks, Scott Walker, et al. Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not. This idea–summarizing a bivariate pattern by comparing pairs of points–reminds me of a well-known statistical identities which I refer to in a paper with David Park: John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope. Pretty cool, and

2 0.66082549 2156 andrew gelman stats-2014-01-01-“Though They May Be Unaware, Newlyweds Implicitly Know Whether Their Marriage Will Be Satisfying”

Introduction: Etienne LeBel writes: You’ve probably already seen it, but I thought you could have a lot of fun with this one!! The article , with the admirably clear title given above, is by James McNulty, Michael Olson, Andrea Meltzer, Matthew Shaffer, and begins as follows: For decades, social psychological theories have posited that the automatic processes captured by implicit measures have implications for social outcomes. Yet few studies have demonstrated any long-term implications of automatic processes, and some scholars have begun to question the relevance and even the validity of these theories. At baseline of our longitudinal study, 135 newlywed couples (270 individuals) completed an explicit measure of their conscious attitudes toward their relationship and an implicit measure of their automatic attitudes toward their partner. They then reported their marital satisfaction every 6 months for the next 4 years. We found no correlation between spouses’ automatic and conscious attitu

3 0.65348363 1059 andrew gelman stats-2011-12-14-Looking at many comparisons may increase the risk of finding something statistically significant by epidemiologists, a population with relatively low multilevel modeling consumption

Introduction: To understand the above title, see here . Masanao writes: This report claims that eating meat increases the risk of cancer. I’m sure you can’t read the page but you probably can understand the graphs. Different bars represent subdivision in the amount of the particular type of meat one consumes. And each chunk is different types of meat. Left is for male right is for female. They claim that the difference is significant, but they are clearly not!! I’m for not eating much meat but this is just way too much… Here’s the graph: I don’t know what to think. If you look carefully you can find one or two statistically significant differences but overall the pattern doesn’t look so compelling. I don’t know what the top and bottom rows are, though. Overall, the pattern in the top row looks like it could represent a real trend, while the graphs on the bottom row look like noise. This could be a good example for our multiple comparisons paper. If the researchers won’t

4 0.65131927 2091 andrew gelman stats-2013-11-06-“Marginally significant”

Introduction: Jeremy Fox writes: You’ve probably seen this [by Matthew Hankins]. . . . Everyone else on Twitter already has. It’s a graph of the frequency with which the phrase “marginally significant” occurs in association with different P values. Apparently it’s real data, from a Google Scholar search, though I haven’t tried to replicate the search myself. My reply: I admire the effort that went into the data collection and the excellent display (following Bill Cleveland etc., I’d prefer a landscape rather than portrait orientation of the graph, also I’d prefer a gritty histogram rather than a smooth density, and I don’t like the y-axis going below zero, nor do I like the box around the graph, also there’s that weird R default where the axis labels are so far from the actual axes, I don’t know whassup with that . . . but these are all minor, minor issues, certainly I’ve done much worse myself many times even in published articles; see the presentation here for lots of examples), an

5 0.63886261 404 andrew gelman stats-2010-11-09-“Much of the recent reported drop in interstate migration is a statistical artifact”

Introduction: Greg Kaplan writes: I noticed that you have blogged a little about interstate migration trends in the US, and thought that you might be interested in a new working paper of mine (joint with Sam Schulhofer-Wohl from the Minneapolis Fed) which I have attached. Briefly, we show that much of the recent reported drop in interstate migration is a statistical artifact: The Census Bureau made an undocumented change in its imputation procedures for missing data in 2006, and this change significantly reduced the number of imputed interstate moves. The change in imputation procedures — not any actual change in migration behavior — explains 90 percent of the reported decrease in interstate migration between the 2005 and 2006 Current Population Surveys, and 42 percent of the decrease between 2000 and 2010. I haven’t had a chance to give a serious look so could only make the quick suggestion to make the graphs smaller and put multiple graphs on a page, This would allow the reader to bett

6 0.63826567 2076 andrew gelman stats-2013-10-24-Chasing the noise: W. Edwards Deming would be spinning in his grave

7 0.63333392 1706 andrew gelman stats-2013-02-04-Too many MC’s not enough MIC’s, or What principles should govern attempts to summarize bivariate associations in large multivariate datasets?

8 0.62099868 609 andrew gelman stats-2011-03-13-Coauthorship norms

9 0.61597937 486 andrew gelman stats-2010-12-26-Age and happiness: The pattern isn’t as clear as you might think

10 0.61300623 1171 andrew gelman stats-2012-02-16-“False-positive psychology”

11 0.61256409 1062 andrew gelman stats-2011-12-16-Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets

12 0.61254907 293 andrew gelman stats-2010-09-23-Lowess is great

13 0.61159039 61 andrew gelman stats-2010-05-31-A data visualization manifesto

14 0.60353178 1805 andrew gelman stats-2013-04-16-Memo to Reinhart and Rogoff: I think it’s best to admit your errors and go on from there

15 0.60311431 2065 andrew gelman stats-2013-10-17-Cool dynamic demographic maps provide beautiful illustration of Chris Rock effect

16 0.60030967 1609 andrew gelman stats-2012-12-06-Stephen Kosslyn’s principles of graphics and one more: There’s no need to cram everything into a single plot

17 0.59905946 488 andrew gelman stats-2010-12-27-Graph of the year

18 0.59733802 2247 andrew gelman stats-2014-03-14-The maximal information coefficient

19 0.59591693 991 andrew gelman stats-2011-11-04-Insecure researchers aren’t sharing their data

20 0.59476376 748 andrew gelman stats-2011-06-06-Why your Klout score is meaningless


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(16, 0.027), (24, 0.089), (99, 0.772)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99995983 589 andrew gelman stats-2011-02-24-On summarizing a noisy scatterplot with a single comparison of two points

Introduction: John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news: It’s also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot — especially given the evidence that people don’t always understand scatterplots. He compares pairs of cases that don’t illustrate the basic hypothesis of Brooks, Scott Walker, et al. Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not. This idea–summarizing a bivariate pattern by comparing pairs of points–reminds me of a well-known statistical identities which I refer to in a paper with David Park: John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope. Pretty cool, and

2 0.99954706 1315 andrew gelman stats-2012-05-12-Question 2 of my final exam for Design and Analysis of Sample Surveys

Introduction: 2. Which of the following are useful goals in a pilot study? (Indicate all that apply.) (a) You can search for statistical significance, then from that decide what to look for in a confirmatory analysis of your full dataset. (b) You can see if you find statistical significance in a pre-chosen comparison of interest. (c) You can examine the direction (positive or negative, even if not statistically significant) of comparisons of interest. (d) With a small sample size, you cannot hope to learn anything conclusive, but you can get a crude estimate of effect size and standard deviation which will be useful in a power analysis to help you decide how large your full study needs to be. (e) You can talk with survey respondents and get a sense of how they perceived your questions. (f) You get a chance to learn about practical difficulties with sampling, nonresponse, and question wording. (g) You can check if your sample is approximately representative of your population. Soluti

3 0.99920475 1431 andrew gelman stats-2012-07-27-Overfitting

Introduction: Ilya Esteban writes: In traditional machine learning and statistical learning techniques, you spend a lot of time selecting your input features, fiddling with model parameter values, etc., all of which leads to the problem of overfitting the data and producing overly optimistic estimates for how good the model really is. You can use techniques such as cross-validation and out-of-sample validation data to try to limit the damage, but they are imperfect solutions at best. While Bayesian models have the great advantage of not forcing you to manually select among the various weights and input features, you still often end up trying different priors and model structures (especially with hierarchical models), before coming up with a “final” model. When applying Bayesian modeling to real world data sets, how does should you evaluate alternate priors and topologies for the model without falling into the same overfitting trap as you do with non-Bayesian models? If you try several different

4 0.99895602 772 andrew gelman stats-2011-06-17-Graphical tools for understanding multilevel models

Introduction: There are a few things I want to do: 1. Understand a fitted model using tools such as average predictive comparisons , R-squared, and partial pooling factors . In defining these concepts, Iain and I came up with some clever tricks, including (but not limited to): - Separating the inputs and averaging over all possible values of the input not being altered (for average predictive comparisons); - Defining partial pooling without referring to a raw-data or maximum-likelihood or no-pooling estimate (these don’t necessarily exist when you’re fitting logistic regression with sparse data); - Defining an R-squared for each level of a multilevel model. The methods get pretty complicated, though, and they have some loose ends–in particular, for average predictive comparisons with continuous input variables. So now we want to implement these in R and put them into arm along with bglmer etc. 2. Setting up coefplot so it works more generally (that is, so the graphics look nice

5 0.99892658 726 andrew gelman stats-2011-05-22-Handling multiple versions of an outcome variable

Introduction: Jay Ulfelder asks: I have a question for you about what to do in a situation where you have two measures of your dependent variable and no prior reasons to strongly favor one over the other. Here’s what brings this up: I’m working on a project with Michael Ross where we’re modeling transitions to and from democracy in countries worldwide since 1960 to estimate the effects of oil income on the likelihood of those events’ occurrence. We’ve got a TSCS data set, and we’re using a discrete-time event history design, splitting the sample by regime type at the start of each year and then using multilevel logistic regression models with parametric measures of time at risk and random intercepts at the country and region levels. (We’re also checking for the usefulness of random slopes for oil wealth at one or the other level and then including them if they improve a model’s goodness of fit.) All of this is being done in Stata with the gllamm module. Our problem is that we have two plausib

6 0.99891698 809 andrew gelman stats-2011-07-19-“One of the easiest ways to differentiate an economist from almost anyone else in society”

7 0.99891663 1434 andrew gelman stats-2012-07-29-FindTheData.org

8 0.99887508 521 andrew gelman stats-2011-01-17-“the Tea Party’s ire, directed at Democrats and Republicans alike”

9 0.99851584 1425 andrew gelman stats-2012-07-23-Examples of the use of hierarchical modeling to generalize to new settings

10 0.99808592 638 andrew gelman stats-2011-03-30-More on the correlation between statistical and political ideology

11 0.9975993 756 andrew gelman stats-2011-06-10-Christakis-Fowler update

12 0.99745584 1813 andrew gelman stats-2013-04-19-Grad students: Participate in an online survey on statistics education

13 0.99738616 180 andrew gelman stats-2010-08-03-Climate Change News

14 0.99736381 1952 andrew gelman stats-2013-07-23-Christakis response to my comment on his comments on social science (or just skip to the P.P.P.S. at the end)

15 0.99710482 740 andrew gelman stats-2011-06-01-The “cushy life” of a University of Illinois sociology professor

16 0.99679971 507 andrew gelman stats-2011-01-07-Small world: MIT, asymptotic behavior of differential-difference equations, Susan Assmann, subgroup analysis, multilevel modeling

17 0.99664658 1585 andrew gelman stats-2012-11-20-“I know you aren’t the plagiarism police, but . . .”

18 0.99653083 1483 andrew gelman stats-2012-09-04-“Bestselling Author Caught Posting Positive Reviews of His Own Work on Amazon”

19 0.99640459 1288 andrew gelman stats-2012-04-29-Clueless Americans think they’ll never get sick

20 0.99612433 174 andrew gelman stats-2010-08-01-Literature and life