andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-2132 knowledge-graph by maker-knowledge-mining

2132 andrew gelman stats-2013-12-13-And now, here’s something that would make Ed Tufte spin in his . . . ummm, Tufte’s still around, actually, so let’s just say I don’t think he’d like it!


meta infos for this blog

Source: html

Introduction: We haven’t had one of these in awhile, having mostly switched to the “chess trivia” and “bad p-values” genres of blogging . . . But I had to come back to the topic after receiving this note from Raghuveer Parthasarathy: Here’s another bad graph you might like. It might (arguably) be even worse than the “worst graphs of the year” you’ve blogged about, since rather than being a poor representation of data, it is simply the plotting of a tautology that mistakenly gives the impression of being data. (And it’s in Nature.) Parthasarathy explains: On the vertical axis we have the probability of being Type 2 Diabetic (T2D). On the horizontal axis we have the probability of being normal. There’s a clear, important trend evident, right? No! The probability of being normal is trivially one minus the probability of being T2D! The graph could not possibly be anything other than a straight line of slope -1. (For the students out there: the complete lack of scatter in the graph is


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We haven’t had one of these in awhile, having mostly switched to the “chess trivia” and “bad p-values” genres of blogging . [sent-1, score-0.226]

2 But I had to come back to the topic after receiving this note from Raghuveer Parthasarathy: Here’s another bad graph you might like. [sent-4, score-0.426]

3 It might (arguably) be even worse than the “worst graphs of the year” you’ve blogged about, since rather than being a poor representation of data, it is simply the plotting of a tautology that mistakenly gives the impression of being data. [sent-5, score-0.806]

4 ) Parthasarathy explains: On the vertical axis we have the probability of being Type 2 Diabetic (T2D). [sent-7, score-0.553]

5 On the horizontal axis we have the probability of being normal. [sent-8, score-0.555]

6 The probability of being normal is trivially one minus the probability of being T2D! [sent-11, score-0.879]

7 The graph could not possibly be anything other than a straight line of slope -1. [sent-12, score-0.43]

8 (For the students out there: the complete lack of scatter in the graph is a strong hint of something wrong. [sent-13, score-0.498]

9 They assign the data points for people with a > 50% probability of being T2D to be red, and the opposite to be green. [sent-15, score-0.368]

10 The graph is simply plotting a tautology, that the probability of x is one minus the probability of not-x, together with a color scheme for labeling x. [sent-16, score-1.546]

11 Paraphrasing Tufte, it has an information-to-ink ratio of approximately zero. [sent-17, score-0.149]

12 Not quite zero: what we seem to have here is a highly inefficient two-dimensional multicolor display of a one-dimensional set of 49 numbers, using dots that are so blurry that we can’t actually get much of a sense of their distribution. [sent-18, score-0.3]

13 All joking aside, I’m guessing this graph would be much better if the x-axis were used for some relevant continuous variable (for example, people’s ages) and the colors used for some discrete variable (for example, some other indicator of health status). [sent-19, score-0.921]

14 Parthasarathy does add: “I’ll stress that the study itself is fascinating. [sent-20, score-0.102]

15 ” So, just to be clear, he’s criticizing the graph, not the underlying research. [sent-21, score-0.073]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('parthasarathy', 0.366), ('probability', 0.286), ('graph', 0.268), ('tautology', 0.244), ('plotting', 0.185), ('minus', 0.185), ('colors', 0.173), ('axis', 0.165), ('genres', 0.129), ('diabetic', 0.129), ('raghuveer', 0.129), ('trivially', 0.122), ('blurry', 0.117), ('hint', 0.117), ('trivia', 0.117), ('mistakenly', 0.113), ('scatter', 0.113), ('joking', 0.109), ('variable', 0.105), ('horizontal', 0.104), ('vertical', 0.102), ('stress', 0.102), ('evident', 0.1), ('inefficient', 0.098), ('switched', 0.097), ('simply', 0.093), ('chess', 0.092), ('ages', 0.092), ('tufte', 0.091), ('slope', 0.091), ('indicator', 0.089), ('blogged', 0.086), ('dots', 0.085), ('arguably', 0.085), ('representation', 0.085), ('scheme', 0.085), ('labeling', 0.084), ('receiving', 0.082), ('assign', 0.082), ('ratio', 0.078), ('clear', 0.077), ('bad', 0.076), ('trend', 0.076), ('worst', 0.074), ('color', 0.074), ('criticizing', 0.073), ('discrete', 0.072), ('approximately', 0.071), ('straight', 0.071), ('explains', 0.07)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 2132 andrew gelman stats-2013-12-13-And now, here’s something that would make Ed Tufte spin in his . . . ummm, Tufte’s still around, actually, so let’s just say I don’t think he’d like it!

Introduction: We haven’t had one of these in awhile, having mostly switched to the “chess trivia” and “bad p-values” genres of blogging . . . But I had to come back to the topic after receiving this note from Raghuveer Parthasarathy: Here’s another bad graph you might like. It might (arguably) be even worse than the “worst graphs of the year” you’ve blogged about, since rather than being a poor representation of data, it is simply the plotting of a tautology that mistakenly gives the impression of being data. (And it’s in Nature.) Parthasarathy explains: On the vertical axis we have the probability of being Type 2 Diabetic (T2D). On the horizontal axis we have the probability of being normal. There’s a clear, important trend evident, right? No! The probability of being normal is trivially one minus the probability of being T2D! The graph could not possibly be anything other than a straight line of slope -1. (For the students out there: the complete lack of scatter in the graph is

2 0.15298072 502 andrew gelman stats-2011-01-04-Cash in, cash out graph

Introduction: David Afshartous writes: I thought this graph [from Ed Easterling] might be good for your blog. The 71 outlined squares show the main story, and the regions of the graph present the information nicely. Looks like the bins for the color coding are not of equal size and of course the end bins are unbounded. Might be interesting to graph the distribution of the actual data for the 71 outlined squares. In addition, I assume that each period begins on Jan 1 so data size could be naturally increased by looking at intervals that start on June 1 as well (where the limit of this process would be to have it at the granularity of one day; while it most likely wouldn’t make much difference, I’ve seen some graphs before where 1 year returns can be quite sensitive to starting date, etc). I agree that (a) the graph could be improved in small ways–in particular, adding half-year data seems like a great idea–and (b) it’s a wonderful, wonderful graph as is. And the NYT graphics people ad

3 0.15069599 2266 andrew gelman stats-2014-03-25-A statistical graphics course and statistical graphics advice

Introduction: Dean Eckles writes: Some of my coworkers at Facebook and I have worked with Udacity to create an online course on exploratory data analysis, including using data visualizations in R as part of EDA. The course has now launched at  https://www.udacity.com/course/ud651  so anyone can take it for free. And Kaiser Fung has  reviewed it . So definitely feel free to promote it! Criticism is also welcome (we are still fine-tuning things and adding more notes throughout). I wrote some more comments about the course  here , including highlighting the interviews with my great coworkers. I didn’t have a chance to look at the course so instead I responded with some generic comments about eda and visualization (in no particular order): - Think of a graph as a comparison. All graphs are comparison (indeed, all statistical analyses are comparisons). If you already have the graph in mind, think of what comparisons it’s enabling. Or if you haven’t settled on the graph yet, think of what

4 0.14319927 878 andrew gelman stats-2011-08-29-Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go

Introduction: I continue to struggle to convey my thoughts on statistical graphics so I’ll try another approach, this time giving my own story. For newcomers to this discussion: the background is that Antony Unwin and I wrote an article on the different goals embodied in information visualization and statistical graphics, but I have difficulty communicating on this point with the infovis people. Maybe if I tell my own story, and then they tell their stories, this will point a way forward to a more constructive discussion. So here goes. I majored in physics in college and I worked in a couple of research labs during the summer. Physicists graph everything. I did most of my plotting on graph paper–this continued through my second year of grad school–and became expert at putting points at 1/5, 2/5, 3/5, and 4/5 between the x and y grid lines. In grad school in statistics, I continued my physics habits and graphed everything I could. I did notice, though, that the faculty and the other

5 0.14082581 671 andrew gelman stats-2011-04-20-One more time-use graph

Introduction: Evan Hensleigh sens me this redesign of the cross-national time use graph : Here was my version: And here was the original: Compared to my graph, Evan’s has better fonts, and that’s important–good fonts can make a display look professional. But I’m not sure about his other innovations. To me, the different colors for the different time-use categories are more of a distraction than a visual aid, and I also don’t like how he made the bars fatter. As I noted in my earlier entry, to me this draws unwanted attention to the negative space between the bars. His country labels are slightly misaligned (particularly Japan and USA), and I really don’t like his horizontal axis at all! He removed the units of hours and put + and – on the edges so that the axes run into each other. What was the point of that? It’s bad news. Also I don’t see any advantage at all to the prehensile tick marks. On the other hand, if Evgn and I were working together on such a graph, we w

6 0.1309059 2154 andrew gelman stats-2013-12-30-Bill Gates’s favorite graph of the year

7 0.12752935 61 andrew gelman stats-2010-05-31-A data visualization manifesto

8 0.12559217 341 andrew gelman stats-2010-10-14-Confusion about continuous probability densities

9 0.11917542 488 andrew gelman stats-2010-12-27-Graph of the year

10 0.11510995 1376 andrew gelman stats-2012-06-12-Simple graph WIN: the example of birthday frequencies

11 0.11388939 1104 andrew gelman stats-2012-01-07-A compelling reason to go to London, Ontario??

12 0.11186113 2200 andrew gelman stats-2014-02-05-Prior distribution for a predicted probability

13 0.10564585 1834 andrew gelman stats-2013-05-01-A graph at war with its caption. Also, how to visualize the same numbers without giving the display a misleading causal feel?

14 0.10381317 832 andrew gelman stats-2011-07-31-Even a good data display can sometimes be improved

15 0.10089644 1609 andrew gelman stats-2012-12-06-Stephen Kosslyn’s principles of graphics and one more: There’s no need to cram everything into a single plot

16 0.10060364 1894 andrew gelman stats-2013-06-12-How to best graph the Beveridge curve, relating the vacancy rate in jobs to the unemployment rate?

17 0.095073596 1684 andrew gelman stats-2013-01-20-Ugly ugly ugly

18 0.094367892 670 andrew gelman stats-2011-04-20-Attractive but hard-to-read graph could be made much much better

19 0.093976445 23 andrew gelman stats-2010-05-09-Popper’s great, but don’t bother with his theory of probability

20 0.093753919 2091 andrew gelman stats-2013-11-06-“Marginally significant”


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.153), (1, -0.005), (2, 0.025), (3, 0.06), (4, 0.104), (5, -0.146), (6, -0.034), (7, 0.077), (8, -0.03), (9, -0.05), (10, 0.006), (11, -0.001), (12, -0.052), (13, -0.012), (14, -0.021), (15, 0.011), (16, 0.064), (17, 0.012), (18, -0.029), (19, -0.036), (20, 0.025), (21, 0.064), (22, -0.044), (23, -0.001), (24, 0.035), (25, 0.014), (26, 0.039), (27, 0.034), (28, -0.062), (29, -0.061), (30, 0.01), (31, -0.01), (32, -0.12), (33, -0.009), (34, -0.08), (35, -0.063), (36, -0.003), (37, -0.039), (38, -0.037), (39, 0.024), (40, -0.02), (41, -0.026), (42, 0.052), (43, -0.035), (44, -0.022), (45, -0.007), (46, 0.036), (47, 0.079), (48, -0.058), (49, 0.003)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97761917 2132 andrew gelman stats-2013-12-13-And now, here’s something that would make Ed Tufte spin in his . . . ummm, Tufte’s still around, actually, so let’s just say I don’t think he’d like it!

Introduction: We haven’t had one of these in awhile, having mostly switched to the “chess trivia” and “bad p-values” genres of blogging . . . But I had to come back to the topic after receiving this note from Raghuveer Parthasarathy: Here’s another bad graph you might like. It might (arguably) be even worse than the “worst graphs of the year” you’ve blogged about, since rather than being a poor representation of data, it is simply the plotting of a tautology that mistakenly gives the impression of being data. (And it’s in Nature.) Parthasarathy explains: On the vertical axis we have the probability of being Type 2 Diabetic (T2D). On the horizontal axis we have the probability of being normal. There’s a clear, important trend evident, right? No! The probability of being normal is trivially one minus the probability of being T2D! The graph could not possibly be anything other than a straight line of slope -1. (For the students out there: the complete lack of scatter in the graph is

2 0.81397605 671 andrew gelman stats-2011-04-20-One more time-use graph

Introduction: Evan Hensleigh sens me this redesign of the cross-national time use graph : Here was my version: And here was the original: Compared to my graph, Evan’s has better fonts, and that’s important–good fonts can make a display look professional. But I’m not sure about his other innovations. To me, the different colors for the different time-use categories are more of a distraction than a visual aid, and I also don’t like how he made the bars fatter. As I noted in my earlier entry, to me this draws unwanted attention to the negative space between the bars. His country labels are slightly misaligned (particularly Japan and USA), and I really don’t like his horizontal axis at all! He removed the units of hours and put + and – on the edges so that the axes run into each other. What was the point of that? It’s bad news. Also I don’t see any advantage at all to the prehensile tick marks. On the other hand, if Evgn and I were working together on such a graph, we w

3 0.80046386 1258 andrew gelman stats-2012-04-10-Why display 6 years instead of 30?

Introduction: I continue to be the go-to guy for bad graphs. Today (i.e., 22 Feb), I received an email from Gary Rosin: I [Rosin] thought you might be interested in this graph showing the decline in median prices of homes since 1997. It exaggerates the proportions by using $150,000 as the floor, rather than zero. Indeed. Here’s the graph: A line plot, rather than a bar plot, would be appropriate here. Also, it’s weird that the headline says “10 years” but the graph has only 6 years. Why not give some perspective and show, say, 30 years?

4 0.78711456 502 andrew gelman stats-2011-01-04-Cash in, cash out graph

Introduction: David Afshartous writes: I thought this graph [from Ed Easterling] might be good for your blog. The 71 outlined squares show the main story, and the regions of the graph present the information nicely. Looks like the bins for the color coding are not of equal size and of course the end bins are unbounded. Might be interesting to graph the distribution of the actual data for the 71 outlined squares. In addition, I assume that each period begins on Jan 1 so data size could be naturally increased by looking at intervals that start on June 1 as well (where the limit of this process would be to have it at the granularity of one day; while it most likely wouldn’t make much difference, I’ve seen some graphs before where 1 year returns can be quite sensitive to starting date, etc). I agree that (a) the graph could be improved in small ways–in particular, adding half-year data seems like a great idea–and (b) it’s a wonderful, wonderful graph as is. And the NYT graphics people ad

5 0.77046019 1011 andrew gelman stats-2011-11-15-World record running times vs. distance

Introduction: Julyan Arbel plots world record running times vs. distance (on the log-log scale): The line has a slope of 1.1. I think it would be clearer to plot speed vs. distance—then you’d get a slope of -0.1, and the numbers would be more directly interpretable. Indeed, this paper by Sandra Savaglio and Vincenzo Carbone (referred to in the comments on Julyan’s blog) plots speed vs. time. Graphing by speed gives more resolution: The upper-left graph in the grid corresponds to the human running records plotted by Arbel. It’s funny that Arbel sees only one line whereas Savaglio and Carbone see two—but if you remove the 100m record at one end and the 100km at the other end, you can see two lines in Arbel’s graph as well. The bottom two graphs show swimming records. Knut would probably have something to say about all this.

6 0.7633242 1104 andrew gelman stats-2012-01-07-A compelling reason to go to London, Ontario??

7 0.7632761 2154 andrew gelman stats-2013-12-30-Bill Gates’s favorite graph of the year

8 0.7562663 915 andrew gelman stats-2011-09-17-(Worst) graph of the year

9 0.74380618 1684 andrew gelman stats-2013-01-20-Ugly ugly ugly

10 0.73527163 488 andrew gelman stats-2010-12-27-Graph of the year

11 0.73001182 262 andrew gelman stats-2010-09-08-Here’s how rumors get started: Lineplots, dotplots, and nonfunctional modernist architecture

12 0.72736698 672 andrew gelman stats-2011-04-20-The R code for those time-use graphs

13 0.72407073 443 andrew gelman stats-2010-12-02-Automating my graphics advice

14 0.72311401 2146 andrew gelman stats-2013-12-24-NYT version of birthday graph

15 0.72194076 829 andrew gelman stats-2011-07-29-Infovis vs. statgraphics: A clear example of their different goals

16 0.71702331 1376 andrew gelman stats-2012-06-12-Simple graph WIN: the example of birthday frequencies

17 0.7137723 1613 andrew gelman stats-2012-12-09-Hey—here’s a photo of me making fun of a silly infographic (from last year)

18 0.70402598 1439 andrew gelman stats-2012-08-01-A book with a bunch of simple graphs

19 0.70044231 1609 andrew gelman stats-2012-12-06-Stephen Kosslyn’s principles of graphics and one more: There’s no need to cram everything into a single plot

20 0.69739026 1253 andrew gelman stats-2012-04-08-Technology speedup graph


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(5, 0.063), (12, 0.012), (16, 0.039), (23, 0.026), (24, 0.17), (27, 0.12), (36, 0.02), (51, 0.01), (54, 0.043), (65, 0.012), (78, 0.013), (83, 0.013), (87, 0.019), (95, 0.077), (96, 0.01), (97, 0.024), (99, 0.243)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9544183 2132 andrew gelman stats-2013-12-13-And now, here’s something that would make Ed Tufte spin in his . . . ummm, Tufte’s still around, actually, so let’s just say I don’t think he’d like it!

Introduction: We haven’t had one of these in awhile, having mostly switched to the “chess trivia” and “bad p-values” genres of blogging . . . But I had to come back to the topic after receiving this note from Raghuveer Parthasarathy: Here’s another bad graph you might like. It might (arguably) be even worse than the “worst graphs of the year” you’ve blogged about, since rather than being a poor representation of data, it is simply the plotting of a tautology that mistakenly gives the impression of being data. (And it’s in Nature.) Parthasarathy explains: On the vertical axis we have the probability of being Type 2 Diabetic (T2D). On the horizontal axis we have the probability of being normal. There’s a clear, important trend evident, right? No! The probability of being normal is trivially one minus the probability of being T2D! The graph could not possibly be anything other than a straight line of slope -1. (For the students out there: the complete lack of scatter in the graph is

2 0.92185885 1472 andrew gelman stats-2012-08-28-Migrating from dot to underscore

Introduction: My C-oriented Stan collaborators have convinced me to use underscore (_) rather than dot (.) as much as possible in expressions in R. For example, I can name a variable n_years rather than n.years. This is fine. But I’m getting annoyed because I need to press the shift key every time I type the underscore. What do people do about this? I know that it’s easy enough to reassign keys (I could, for example, assign underscore to backslash, which I never use). I’m just wondering what C programmers actually do. Do they reassign the key or do they just get used to pressing Shift? P.S. In comments, Ben Hyde points to Google’s R style guide, which recommends that variable names use dots, not underscore or camel case, for variable names (for example, “avg.clicks” rather than “avg_Clicks” or “avgClicks”). I think they’re recommending this to be consistent with R coding conventions . I am switching to underscores in R variable names to be consistent with C. Otherwise we were run

3 0.91842961 134 andrew gelman stats-2010-07-08-“What do you think about curved lines connecting discrete data-points?”

Introduction: John Keltz writes: What do you think about curved lines connecting discrete data-points? (For example, here .) The problem with the smoothed graph is it seems to imply that something is going on in between the discrete data points, which is false. However, the straight-line version isn’t representing actual events either- it is just helping the eye connect each point. So maybe the curved version is also just helping the eye connect each point, and looks better doing it. In my own work (value-added modeling of achievement test scores) I use straight lines, but I guess I am not too bothered when people use smoothing. I’d appreciate your input. Regular readers will be unsurprised that, yes, I have an opinion on this one, and that this opinion is connected to some more general ideas about statistical graphics. In general I’m not a fan of the curved lines. They’re ok, but I don’t really see the point. I can connect the dots just fine without the curves. The more general id

4 0.91408265 343 andrew gelman stats-2010-10-15-?

Introduction: How am I supposed to handle this sort of thing? (See below.) I just stuck it one of my email folders without responding, but then I wondered . . . what’s it all about? Is there some sort of Glengarry Glen Ross-like parallel world where down-on-their-luck Jack Lemmons of public relations world send out electronic cold calls? More than anything else, this sort of thing makes me glad I have a steady job. Here’s the (unsolicited) email, which came with the subject line “Please help a reporter do his job”: Dear Andrew, As an Editor for the Bulldog Reporter (www.bulldogreporter.com/dailydog), a media relations trade publication, my job is to help ensure that my readers have accurate info about you and send you the best quality pitches. By taking five minutes or less to answer my questions (pasted below), you’ll receive targeted PR pitches from our client base that will match your beat and interests. Any help or direction is appreciated. Here are my questions. We have you listed

5 0.91079944 804 andrew gelman stats-2011-07-15-Static sensitivity analysis

Introduction: This is one of my favorite ideas. I used it in an application but have never formally studied it or written it up as a general method. Sensitivity analysis is when you check how inferences change when you vary fit several different models or when you vary inputs within a model. Sensitivity analysis is often recommended but is typically difficult to do, what with the hassle of carrying around all these different estimates. In Bayesian inference, sensitivity analysis is associated with varying the prior distribution, which irritates me: why not consider sensitivity to the likelihood, as that’s typically just as arbitrary as the prior while having a much larger effect on the inferences. So we came up with static sensitivity analysis , which is a way to assess sensitivity to assumptions while fitting only one model. The idea is that Bayesian posterior simulation gives you a range of parameter values, and from these you can learn about sensitivity directly. The published exampl

6 0.90794742 1238 andrew gelman stats-2012-03-31-Dispute about ethics of data sharing

7 0.90775663 465 andrew gelman stats-2010-12-13-$3M health care prediction challenge

8 0.90394652 708 andrew gelman stats-2011-05-12-Improvement of 5 MPG: how many more auto deaths?

9 0.90247077 930 andrew gelman stats-2011-09-28-Wiley Wegman chutzpah update: Now you too can buy a selection of garbled Wikipedia articles, for a mere $1400-$2800 per year!

10 0.9012289 1869 andrew gelman stats-2013-05-24-In which I side with Neyman over Fisher

11 0.9003067 802 andrew gelman stats-2011-07-13-Super Sam Fuld Needs Your Help (with Foul Ball stats)

12 0.89585817 341 andrew gelman stats-2010-10-14-Confusion about continuous probability densities

13 0.89418411 266 andrew gelman stats-2010-09-09-The future of R

14 0.88907373 1834 andrew gelman stats-2013-05-01-A graph at war with its caption. Also, how to visualize the same numbers without giving the display a misleading causal feel?

15 0.88878572 66 andrew gelman stats-2010-06-03-How can news reporters avoid making mistakes when reporting on technical issues? Or, Data used to justify “Data Used to Justify Health Savings Can Be Shaky” can be shaky

16 0.88821107 2319 andrew gelman stats-2014-05-05-Can we make better graphs of global temperature history?

17 0.88775444 2358 andrew gelman stats-2014-06-03-Did you buy laundry detergent on their most recent trip to the store? Also comments on scientific publication and yet another suggestion to do a study that allows within-person comparisons

18 0.8872937 652 andrew gelman stats-2011-04-07-Minor-league Stats Predict Major-league Performance, Sarah Palin, and Some Differences Between Baseball and Politics

19 0.88552517 1988 andrew gelman stats-2013-08-19-BDA3 still (I hope) at 40% off! (and a link to one of my favorite papers)

20 0.88312554 2135 andrew gelman stats-2013-12-15-The UN Plot to Force Bayesianism on Unsuspecting Americans (penalized B-Spline edition)