andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-687 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Nathan Roseberry writes: I thought I had read on your blog that bar charts should always include zero on the scale, but a search of your blog (or google) didn’t return what I was looking for. Is it considered a best practice to always include zero on the axis for bar charts? Has this been written in a book? My reply: The idea is that the area of the bar represents “how many” or “how much.” The bar has to go down to 0 for that to work. You don’t have to have your y-axis go to zero, but if you want the axis to go anywhere else, don’t use a bar graph, use a line graph. Usually line graphs are better anyway. I’m sure this is all in a book somewhere.
sentIndex sentText sentNum sentScore
1 Nathan Roseberry writes: I thought I had read on your blog that bar charts should always include zero on the scale, but a search of your blog (or google) didn’t return what I was looking for. [sent-1, score-2.068]
2 Is it considered a best practice to always include zero on the axis for bar charts? [sent-2, score-1.722]
3 My reply: The idea is that the area of the bar represents “how many” or “how much. [sent-4, score-0.947]
4 You don’t have to have your y-axis go to zero, but if you want the axis to go anywhere else, don’t use a bar graph, use a line graph. [sent-6, score-1.767]
wordName wordTfidf (topN-words)
[('bar', 0.69), ('axis', 0.276), ('charts', 0.276), ('zero', 0.259), ('line', 0.156), ('include', 0.154), ('nathan', 0.147), ('go', 0.146), ('anywhere', 0.142), ('represents', 0.114), ('book', 0.113), ('always', 0.112), ('somewhere', 0.11), ('return', 0.109), ('google', 0.098), ('search', 0.097), ('blog', 0.096), ('area', 0.094), ('scale', 0.092), ('considered', 0.088), ('practice', 0.085), ('use', 0.083), ('usually', 0.082), ('graphs', 0.08), ('written', 0.079), ('else', 0.079), ('graph', 0.075), ('looking', 0.069), ('reply', 0.063), ('best', 0.058), ('didn', 0.058), ('read', 0.056), ('thought', 0.054), ('sure', 0.051), ('idea', 0.049), ('better', 0.047), ('want', 0.045), ('many', 0.04), ('writes', 0.033)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 687 andrew gelman stats-2011-04-29-Zero is zero
Introduction: Nathan Roseberry writes: I thought I had read on your blog that bar charts should always include zero on the scale, but a search of your blog (or google) didn’t return what I was looking for. Is it considered a best practice to always include zero on the axis for bar charts? Has this been written in a book? My reply: The idea is that the area of the bar represents “how many” or “how much.” The bar has to go down to 0 for that to work. You don’t have to have your y-axis go to zero, but if you want the axis to go anywhere else, don’t use a bar graph, use a line graph. Usually line graphs are better anyway. I’m sure this is all in a book somewhere.
2 0.31919581 1090 andrew gelman stats-2011-12-28-“. . . extending for dozens of pages”
Introduction: Kaiser writes : I have read a fair share of bore-them-to-tears compilation of survey research results – you know, those presentations with one multi-colored, stacked or grouped bar chart after another, extending for dozens of pages. I hate those grouped bar charts also—as I’ve written repeatedly, the central role of almost all statistical displays is to make comparisons, and you can make twice as many comparisons with a line plot as a bar plot. But I suspect the real problem with the reports that Kaiser is talking about is the “extending for dozens of pages” part. If they could just print each individual plot smaller and put dozens on a page, you could maybe get through the whole report in two or three pages. Almost always, graphs are too large. I’ve even seen abominations such as a fifty-page report with a single huge pie chart on each page. As Kaiser says, think about communication! A report with one big pie chart or bar plot per page is like a text document with one w
3 0.2760933 126 andrew gelman stats-2010-07-03-Graphical presentation of risk ratios
Introduction: Jimmy passes this article by Ahmad Reza Hosseinpoor and Carla AbouZahr. I have little to say, except that (a) they seem to be making a reasonable point, and (b) those bar graphs are pretty ugly.
Introduction: John Kastellec points me to this blog by Ezra Klein criticizing the following graph from a recent Republican Party report: Klein (following Alexander Hart ) slams the graph for not going all the way to zero on the y-axis, thus making the projected change seem bigger than it really is. I agree with Klein and Hart that, if you’re gonna do a bar chart, you want the bars to go down to 0. On the other hand, a projected change from 19% to 23% is actually pretty big, and I don’t see the point of using a graphical display that hides it. The solution: Ditch the bar graph entirely and replace it by a lineplot , in particular, a time series with year-by-year data. The time series would have several advantages: 1. Data are placed in context. You’d see every year, instead of discrete averages, and you’d get to see the changes in the context of year-to-year variation. 2. With the time series, you can use whatever y-axis works with the data. No need to go to zero. P.S. I l
5 0.18632703 1061 andrew gelman stats-2011-12-16-CrossValidated: A place to post your statistics questions
Introduction: Seth Rogers writes: I [Rogers] am a member of an online community of statisticians where I burn a great deal of time (and a recovering cog sci researcher). Our community website is a peer-reviewed Q and A spanning stats topics ranging from applications to mathematical theory. Our online community consists of mostly university faculty, grad students and technical consultants. The answer quality is very strong and the web design is intuitive. I think you and your readers are like-minded and would be really interested in some of the topics on the site, CrossValidated (you may know the sister site: stackoverflow.com ). The philosophy is purely to further knowledge for the sake of knowledge and take pride in learning. I took a quick look and the site seemed like it could be useful to people. The only thing I didn’t understand is, why doesn’t it have a search function? (Or maybe it was there somewhere and I couldn’t find it.) P.S. to all the commenters who wrote replies such
6 0.16043074 37 andrew gelman stats-2010-05-17-Is chartjunk really “more useful” than plain graphs? I don’t think so.
7 0.15446036 1258 andrew gelman stats-2012-04-10-Why display 6 years instead of 30?
8 0.15134293 1498 andrew gelman stats-2012-09-16-Choices in graphing parallel time series
9 0.14802384 1800 andrew gelman stats-2013-04-12-Too tired to mock
10 0.10954157 305 andrew gelman stats-2010-09-29-Decision science vs. social psychology
11 0.10456818 61 andrew gelman stats-2010-05-31-A data visualization manifesto
12 0.10215199 1176 andrew gelman stats-2012-02-19-Standardized writing styles and standardized graphing styles
13 0.10076013 672 andrew gelman stats-2011-04-20-The R code for those time-use graphs
14 0.10063709 446 andrew gelman stats-2010-12-03-Is 0.05 too strict as a p-value threshold?
15 0.098387748 991 andrew gelman stats-2011-11-04-Insecure researchers aren’t sharing their data
16 0.093435585 2266 andrew gelman stats-2014-03-25-A statistical graphics course and statistical graphics advice
17 0.092676416 878 andrew gelman stats-2011-08-29-Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go
18 0.091606848 798 andrew gelman stats-2011-07-12-Sometimes a graph really is just ugly
20 0.089676574 1180 andrew gelman stats-2012-02-22-I’m officially no longer a “rogue”
topicId topicWeight
[(0, 0.11), (1, -0.022), (2, -0.027), (3, 0.068), (4, 0.099), (5, -0.111), (6, -0.007), (7, 0.025), (8, 0.003), (9, 0.015), (10, 0.03), (11, -0.034), (12, 0.016), (13, -0.011), (14, 0.044), (15, -0.0), (16, -0.022), (17, 0.019), (18, 0.027), (19, -0.015), (20, 0.028), (21, 0.027), (22, -0.019), (23, 0.013), (24, 0.041), (25, -0.024), (26, 0.065), (27, -0.008), (28, -0.029), (29, 0.003), (30, -0.022), (31, -0.015), (32, -0.043), (33, 0.003), (34, -0.089), (35, -0.014), (36, 0.042), (37, -0.059), (38, 0.003), (39, -0.045), (40, -0.002), (41, 0.011), (42, 0.015), (43, 0.126), (44, -0.055), (45, -0.035), (46, 0.002), (47, 0.082), (48, 0.003), (49, -0.017)]
simIndex simValue blogId blogTitle
same-blog 1 0.96647853 687 andrew gelman stats-2011-04-29-Zero is zero
Introduction: Nathan Roseberry writes: I thought I had read on your blog that bar charts should always include zero on the scale, but a search of your blog (or google) didn’t return what I was looking for. Is it considered a best practice to always include zero on the axis for bar charts? Has this been written in a book? My reply: The idea is that the area of the bar represents “how many” or “how much.” The bar has to go down to 0 for that to work. You don’t have to have your y-axis go to zero, but if you want the axis to go anywhere else, don’t use a bar graph, use a line graph. Usually line graphs are better anyway. I’m sure this is all in a book somewhere.
2 0.79415762 1439 andrew gelman stats-2012-08-01-A book with a bunch of simple graphs
Introduction: Howard Friedman sent me a new book, The Measure of a Nation, subtitled How to Regain America’s Competitive Edge and Boost Our Global Standing. Without commenting on the substance of Friedman’s recommendations, I’d like to endorse his strategy of presentation, which is to display graph after graph after graph showing the same message over and over again, which is that the U.S. is outperformed by various other countries (mostly in Europe) on a variety of measures. These aren’t graphs I would ever make—they are scatterplots in which the x-axis conveys no information. But they have the advantage of repetition: once you figure out how to read one of the graphs, you can read the others easily. Here’s an example which I found from a quick Google: I can’t actually figure out what is happening on the x-axis, nor do I understand the “star, middle child, dog” thing. But I like the use of graphics. Lots more fun than bullet points. Seriously. P.S. Just to be clear: I am not trying
3 0.76127362 126 andrew gelman stats-2010-07-03-Graphical presentation of risk ratios
Introduction: Jimmy passes this article by Ahmad Reza Hosseinpoor and Carla AbouZahr. I have little to say, except that (a) they seem to be making a reasonable point, and (b) those bar graphs are pretty ugly.
4 0.75371152 672 andrew gelman stats-2011-04-20-The R code for those time-use graphs
Introduction: By popular demand, here’s my R script for the time-use graphs : # The data a1 <- c(4.2,3.2,11.1,1.3,2.2,2.0) a2 <- c(3.9,3.2,10.0,0.8,3.1,3.1) a3 <- c(6.3,2.5,9.8,0.9,2.2,2.4) a4 <- c(4.4,3.1,9.8,0.8,3.3,2.7) a5 <- c(4.8,3.0,9.9,0.7,3.3,2.4) a6 <- c(4.0,3.4,10.5,0.7,3.3,2.1) a <- rbind(a1,a2,a3,a4,a5,a6) avg <- colMeans (a) avg.array <- t (array (avg, rev(dim(a)))) diff <- a - avg.array country.name <- c("France", "Germany", "Japan", "Britain", "USA", "Turkey") # The line plots par (mfrow=c(2,3), mar=c(4,4,2,.5), mgp=c(2,.7,0), tck=-.02, oma=c(3,0,4,0), bg="gray96", fg="gray30") for (i in 1:6){ plot (c(1,6), c(-1,1.7), xlab="", ylab="", xaxt="n", yaxt="n", bty="l", type="n") lines (1:6, diff[i,], col="blue") points (1:6, diff[i,], pch=19, col="black") if (i>3){ axis (1, c(1,3,5), c ("Work,\nstudy", "Eat,\nsleep", "Leisure"), mgp=c(2,1.5,0), tck=0, cex.axis=1.2) axis (1, c(2,4,6), c ("Unpaid\nwork", "Personal\nCare", "Other"), mgp=c(2,1.5,0),
Introduction: John Kastellec points me to this blog by Ezra Klein criticizing the following graph from a recent Republican Party report: Klein (following Alexander Hart ) slams the graph for not going all the way to zero on the y-axis, thus making the projected change seem bigger than it really is. I agree with Klein and Hart that, if you’re gonna do a bar chart, you want the bars to go down to 0. On the other hand, a projected change from 19% to 23% is actually pretty big, and I don’t see the point of using a graphical display that hides it. The solution: Ditch the bar graph entirely and replace it by a lineplot , in particular, a time series with year-by-year data. The time series would have several advantages: 1. Data are placed in context. You’d see every year, instead of discrete averages, and you’d get to see the changes in the context of year-to-year variation. 2. With the time series, you can use whatever y-axis works with the data. No need to go to zero. P.S. I l
6 0.71437442 1104 andrew gelman stats-2012-01-07-A compelling reason to go to London, Ontario??
7 0.70909065 1011 andrew gelman stats-2011-11-15-World record running times vs. distance
8 0.70014644 1090 andrew gelman stats-2011-12-28-“. . . extending for dozens of pages”
9 0.69945127 1258 andrew gelman stats-2012-04-10-Why display 6 years instead of 30?
10 0.68403971 1606 andrew gelman stats-2012-12-05-The Grinch Comes Back
12 0.67161679 37 andrew gelman stats-2010-05-17-Is chartjunk really “more useful” than plain graphs? I don’t think so.
13 0.67115635 296 andrew gelman stats-2010-09-26-A simple semigraphic display
15 0.66251159 1684 andrew gelman stats-2013-01-20-Ugly ugly ugly
16 0.66087145 671 andrew gelman stats-2011-04-20-One more time-use graph
17 0.65904647 1800 andrew gelman stats-2013-04-12-Too tired to mock
18 0.65820903 2154 andrew gelman stats-2013-12-30-Bill Gates’s favorite graph of the year
20 0.64893526 305 andrew gelman stats-2010-09-29-Decision science vs. social psychology
topicId topicWeight
[(5, 0.051), (15, 0.022), (16, 0.072), (24, 0.219), (41, 0.018), (53, 0.126), (77, 0.019), (99, 0.307)]
simIndex simValue blogId blogTitle
same-blog 1 0.98719066 687 andrew gelman stats-2011-04-29-Zero is zero
Introduction: Nathan Roseberry writes: I thought I had read on your blog that bar charts should always include zero on the scale, but a search of your blog (or google) didn’t return what I was looking for. Is it considered a best practice to always include zero on the axis for bar charts? Has this been written in a book? My reply: The idea is that the area of the bar represents “how many” or “how much.” The bar has to go down to 0 for that to work. You don’t have to have your y-axis go to zero, but if you want the axis to go anywhere else, don’t use a bar graph, use a line graph. Usually line graphs are better anyway. I’m sure this is all in a book somewhere.
2 0.98016846 1047 andrew gelman stats-2011-12-08-I Am Too Absolutely Heteroskedastic for This Probit Model
Introduction: Soren Lorensen wrote: I’m working on a project that uses a binary choice model on panel data. Since I have panel data and am using MLE, I’m concerned about heteroskedasticity making my estimates inconsistent and biased. Are you familiar with any statistical packages with pre-built tests for heteroskedasticity in binary choice ML models? If not, is there value in cutting my data into groups over which I guess the error variance might vary and eyeballing residual plots? Have you other suggestions about how I might resolve this concern? I replied that I wouldn’t worry so much about heteroskedasticity. Breaking up the data into pieces might make sense, but for the purpose of estimating how the coefficients might vary—that is, nonlinearity and interactions. Soren shot back: I’m somewhat puzzled however: homoskedasticity is an identifying assumption in estimating a probit model: if we don’t have it all sorts of bad things can happen to our parameter estimates. Do you suggest n
3 0.97929245 547 andrew gelman stats-2011-01-31-Using sample size in the prior distribution
Introduction: Mike McLaughlin writes: Consider the Seeds example in vol. 1 of the BUGS examples. There, a binomial likelihood has a p parameter constructed, via logit, from two covariates. What I am wondering is: Would it be legitimate, in a binomial + logit problem like this, to allow binomial p[i] to be a function of the corresponding n[i] or would that amount to using the data in the prior? In other words, in the context of the Seeds example, is r[] the only data or is n[] data as well and therefore not permissible in a prior formulation? I [McLaughlin] currently have a model with a common beta prior for all p[i] but would like to mitigate this commonality (a kind of James-Stein effect) when there are lots of observations for some i. But this seems to feed the data back into the prior. Does it really? It also occurs to me [McLaughlin] that, perhaps, a binomial likelihood is not the one to use here (not flexible enough). My reply: Strictly speaking, “n” is data, and so what you wa
4 0.97253573 446 andrew gelman stats-2010-12-03-Is 0.05 too strict as a p-value threshold?
Introduction: Seth sent along an article (not by him) from the psychology literature and wrote: This is a good example of your complaint about statistical significance. The authors want to say that predictability of information determines how distracting something is and have two conditions that vary in predictability. One is significantly distracting, the other isn’t. But the two conditions are not significantly different from each other. So the two conditions are different more weakly than p = 0.05. I don’t think the reviewers failed to notice this. They just thought it should be published anyway, is my guess. To me, the interesting question is: where should the bar be? at p = 0.05? at p = 0.10? something else? How can we figure out where to put the bar? I replied: My quick answer is that we have to get away from .05 and .10 and move to something that takes into account prior information. This could be Bayesian (of course) or could be done classically using power calculations, as disc
Introduction: I’m sorry I don’t have any new zombie papers in time for Halloween. Instead I’d like to be a little monster by reproducing a mini-rant from this article on experimental reasoning in social science: I will restrict my discussion to social science examples. Social scientists are often tempted to illustrate their ideas with examples from medical research. When it comes to medicine, though, we are, with rare exceptions, at best ignorant laypersons (in my case, not even reaching that level), and it is my impression that by reaching for medical analogies we are implicitly trying to borrow some of the scientific and cultural authority of that field for our own purposes. Evidence-based medicine is the subject of a large literature of its own (see, for example, Lau, Ioannidis, and Schmid, 1998).
6 0.96079963 1905 andrew gelman stats-2013-06-18-There are no fat sprinters
7 0.95963144 2149 andrew gelman stats-2013-12-26-Statistical evidence for revised standards
8 0.95814848 495 andrew gelman stats-2010-12-31-“Threshold earners” and economic inequality
9 0.95770824 991 andrew gelman stats-2011-11-04-Insecure researchers aren’t sharing their data
11 0.95265406 2155 andrew gelman stats-2013-12-31-No on Yes-No decisions
12 0.9521172 1858 andrew gelman stats-2013-05-15-Reputations changeable, situations tolerable
13 0.95173323 248 andrew gelman stats-2010-09-01-Ratios where the numerator and denominator both change signs
14 0.9497627 488 andrew gelman stats-2010-12-27-Graph of the year
15 0.94698381 1856 andrew gelman stats-2013-05-14-GPstuff: Bayesian Modeling with Gaussian Processes
16 0.94510806 2313 andrew gelman stats-2014-04-30-Seth Roberts
17 0.94307059 466 andrew gelman stats-2010-12-13-“The truth wears off: Is there something wrong with the scientific method?”
18 0.94258547 899 andrew gelman stats-2011-09-10-The statistical significance filter
19 0.94227993 1155 andrew gelman stats-2012-02-05-What is a prior distribution?
20 0.94205278 1956 andrew gelman stats-2013-07-25-What should be in a machine learning course?