andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-61 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Details matter (at least, they do for me), but we don’t yet have a systematic way of going back and forth between the structure of a graph, its details, and the underlying questions that motivate our visualizations. (Cleveland, Wilkinson, and others have written a bit on how to formalize these connections, and I’ve thought about it too, but we have a ways to go.) I was thinking about this difficulty after reading an article on graphics by some computer scientists that was well-written but to me lacked a feeling for the linkages between substantive/statistical goals and graphical details. I have problems with these issues too, and my point here is not to criticize but to move the discussion forward. When thinking about visualization, how important are the details? Aleks pointed me to this article by Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky, “A Tour through the Visualization Zoo: A survey of powerful visualization techniques, from the obvious to the obscure.” Th
sentIndex sentText sentNum sentScore
1 ) I was thinking about this difficulty after reading an article on graphics by some computer scientists that was well-written but to me lacked a feeling for the linkages between substantive/statistical goals and graphical details. [sent-3, score-0.434]
2 ” They make some reasonable points, but a big problem I have with the article is in the details of the actual visualizations they show. [sent-7, score-0.428]
3 Figure 1B has that notorious alphabetical order, also some weird visual artifacts that get created by stacking curves, and a x-axis that is not fully labeled. [sent-9, score-0.451]
4 ) Yes, I realize that one purpose of the article is to criticize such graphs (“While such charts have proven popular in recent years, they do have some notable limitations. [sent-12, score-0.358]
5 Still, it doesn’t help to list the industries in alphabetical order. [sent-18, score-0.437]
6 Something went terribly wrong here; perhaps each graph was rescaled to its own range, which wouldn’t make much sense in a small multiples plot. [sent-21, score-0.35]
7 I could keep going here through all the other graphs in the article But maybe these criticisms are irrelevant. [sent-25, score-0.43]
8 Perhaps such glitches (from my perspective) are either irrelevant to the general message of the graph or, from the other direction, force the reader to look at the graph and read the surrounding text more clearly to figure out what’s going on. [sent-31, score-0.645]
9 After all, a graph isn’t a TV show, readers aren’t passive, so maybe it’s actually good to make them work to figure out what’s going on. [sent-32, score-0.685]
10 At a statistical level, though, I think the details are very important, because they connect the data being graphed with the underlying questions being studied. [sent-33, score-0.448]
11 If you’re not interested in an alphabetical ordering, you don’t want to put it on a graph. [sent-35, score-0.302]
12 If you want to convey something beyond simply that big cars get worse gas mileage, you’ll want to invert the axes on your parallel coordinate plot. [sent-36, score-0.342]
13 If you wanted to say I’m wrong, you could perhaps invoke an opportunity cost argument, that the time I spend worrying about where to label the lines on a graph (not to mention the time I spend blogging about it! [sent-39, score-0.415]
14 For me, the details of the graphing are absolutely necessary to the statistical analysis–decades ago, before I did everything on the computer, I spent lots and lots of time making graphs by hand, using colored pens and all the rest–but for others, maybe not. [sent-41, score-0.688]
15 article is that it doesn’t mention what are perhaps the three most important kinds of graphs: dot plots, line plots, and scatterplots. [sent-43, score-0.587]
16 See here here for a dotplot (from Jeff and Justin), and here for some line plots and scatterplots. [sent-44, score-0.298]
17 A clearer understanding of line plots would’ve been a big help in making Figure 1C, for example. [sent-48, score-0.435]
18 What’s missing is the link from the substantive questions (what are the reasons for making the graph in the first place? [sent-54, score-0.354]
19 Instead we go through menus of possibilities (actual forced options on computer packages, or mental menus in which we make choices based on what we’ve seen before) and then have to go back and fix things. [sent-57, score-0.424]
20 I didn’t feel like revising the whole piece, but I guess I will if I want to rewrite the article for publication somewhere, which maybe I’ll do if I find the right coauthor. [sent-70, score-0.295]
wordName wordTfidf (topN-words)
[('details', 0.245), ('alphabetical', 0.212), ('plots', 0.208), ('graph', 0.194), ('figure', 0.193), ('heer', 0.188), ('industries', 0.164), ('visualization', 0.163), ('graphs', 0.161), ('dot', 0.138), ('manifesto', 0.137), ('menus', 0.137), ('article', 0.122), ('stacking', 0.118), ('unemployment', 0.109), ('axes', 0.101), ('perhaps', 0.095), ('thinking', 0.094), ('ordering', 0.091), ('line', 0.09), ('want', 0.09), ('readers', 0.09), ('computer', 0.089), ('questions', 0.084), ('labels', 0.084), ('maybe', 0.083), ('scale', 0.08), ('important', 0.079), ('labeled', 0.076), ('making', 0.076), ('criticize', 0.075), ('forth', 0.074), ('systematic', 0.069), ('hand', 0.066), ('graphical', 0.066), ('going', 0.064), ('mention', 0.063), ('faulting', 0.063), ('invoke', 0.063), ('linkages', 0.063), ('pens', 0.063), ('fully', 0.062), ('make', 0.061), ('help', 0.061), ('something', 0.061), ('statistical', 0.06), ('factor', 0.059), ('bostock', 0.059), ('artifacts', 0.059), ('graphed', 0.059)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 61 andrew gelman stats-2010-05-31-A data visualization manifesto
Introduction: Details matter (at least, they do for me), but we don’t yet have a systematic way of going back and forth between the structure of a graph, its details, and the underlying questions that motivate our visualizations. (Cleveland, Wilkinson, and others have written a bit on how to formalize these connections, and I’ve thought about it too, but we have a ways to go.) I was thinking about this difficulty after reading an article on graphics by some computer scientists that was well-written but to me lacked a feeling for the linkages between substantive/statistical goals and graphical details. I have problems with these issues too, and my point here is not to criticize but to move the discussion forward. When thinking about visualization, how important are the details? Aleks pointed me to this article by Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky, “A Tour through the Visualization Zoo: A survey of powerful visualization techniques, from the obvious to the obscure.” Th
Introduction: I continue to struggle to convey my thoughts on statistical graphics so I’ll try another approach, this time giving my own story. For newcomers to this discussion: the background is that Antony Unwin and I wrote an article on the different goals embodied in information visualization and statistical graphics, but I have difficulty communicating on this point with the infovis people. Maybe if I tell my own story, and then they tell their stories, this will point a way forward to a more constructive discussion. So here goes. I majored in physics in college and I worked in a couple of research labs during the summer. Physicists graph everything. I did most of my plotting on graph paper–this continued through my second year of grad school–and became expert at putting points at 1/5, 2/5, 3/5, and 4/5 between the x and y grid lines. In grad school in statistics, I continued my physics habits and graphed everything I could. I did notice, though, that the faculty and the other
3 0.24532597 676 andrew gelman stats-2011-04-23-The payoff: $650. The odds: 1 in 500,000.
Introduction: Details here .
4 0.22307111 855 andrew gelman stats-2011-08-16-Infovis and statgraphics update update
Introduction: To continue our discussion from last week , consider three positions regarding the display of information: (a) The traditional tabular approach. This is how most statisticians, econometricians, political scientists, sociologists, etc., seem to operate. They understand the appeal of a pretty graph, and they’re willing to plot some data as part of an exploratory data analysis, but they see their serious research as leading to numerical estimates, p-values, tables of numbers. These people might use a graph to illustrate their points but they don’t see them as necessary in their research. (b) Statistical graphics as performed by Howard Wainer, Bill Cleveland, Dianne Cook, etc. They–we–see graphics as central to the process of statistical modeling and data analysis and are interested in graphs (static and dynamic) that display every data point as transparently as possible. (c) Information visualization or infographics, as performed by graphics designers and statisticians who are
5 0.21962497 252 andrew gelman stats-2010-09-02-R needs a good function to make line plots
Introduction: More and more I’m thinking that line plots are great. More specifically, two-way grids of line plots on common scales, with one, two, or three lines per plot (enough to show comparisons but not so many that you can’t tell the lines apart). Also dot plots, of the sort that have been masterfully used by Lax and Phillips to show comparisons and trends in support for gay rights. There’s a big step missing, though, and that is to be able to make these graphs as a default. We have to figure out the right way to structure the data so these graphs come naturally. Then when it’s all working, we can talk the Excel people into implementing our ideas. I’m not asking to be paid here; all our ideas are in the public domain and I’m happy for Microsoft or Google or whoever to copy us. P.S. Drew Conway writes: This could be accomplished with ggplot2 using various combinations of the grammar. If I am understanding what you mean by line plots, here are some examples with code . In fact,
6 0.21773897 2266 andrew gelman stats-2014-03-25-A statistical graphics course and statistical graphics advice
7 0.17753415 1848 andrew gelman stats-2013-05-09-A tale of two discussion papers
8 0.17701855 488 andrew gelman stats-2010-12-27-Graph of the year
9 0.16885544 1811 andrew gelman stats-2013-04-18-Psychology experiments to understand what’s going on with data graphics?
11 0.16245027 1764 andrew gelman stats-2013-03-15-How do I make my graphs?
13 0.15682431 1176 andrew gelman stats-2012-02-19-Standardized writing styles and standardized graphing styles
15 0.15003304 798 andrew gelman stats-2011-07-12-Sometimes a graph really is just ugly
16 0.14899975 2154 andrew gelman stats-2013-12-30-Bill Gates’s favorite graph of the year
17 0.14670005 816 andrew gelman stats-2011-07-22-“Information visualization” vs. “Statistical graphics”
18 0.1462025 2172 andrew gelman stats-2014-01-14-Advice on writing research articles
19 0.14544921 1684 andrew gelman stats-2013-01-20-Ugly ugly ugly
20 0.1444066 1767 andrew gelman stats-2013-03-17-The disappearing or non-disappearing middle class
topicId topicWeight
[(0, 0.301), (1, -0.087), (2, -0.029), (3, 0.105), (4, 0.186), (5, -0.231), (6, -0.108), (7, 0.059), (8, -0.026), (9, -0.004), (10, 0.028), (11, -0.025), (12, -0.044), (13, 0.015), (14, 0.023), (15, -0.013), (16, 0.021), (17, -0.019), (18, -0.061), (19, -0.001), (20, 0.007), (21, -0.01), (22, -0.001), (23, 0.053), (24, 0.008), (25, -0.016), (26, 0.036), (27, 0.006), (28, -0.018), (29, 0.011), (30, 0.028), (31, 0.003), (32, -0.025), (33, -0.014), (34, -0.031), (35, -0.013), (36, -0.018), (37, -0.01), (38, -0.002), (39, -0.04), (40, 0.009), (41, 0.013), (42, -0.004), (43, 0.022), (44, -0.029), (45, -0.029), (46, -0.022), (47, 0.001), (48, 0.016), (49, 0.013)]
simIndex simValue blogId blogTitle
same-blog 1 0.97313756 61 andrew gelman stats-2010-05-31-A data visualization manifesto
Introduction: Details matter (at least, they do for me), but we don’t yet have a systematic way of going back and forth between the structure of a graph, its details, and the underlying questions that motivate our visualizations. (Cleveland, Wilkinson, and others have written a bit on how to formalize these connections, and I’ve thought about it too, but we have a ways to go.) I was thinking about this difficulty after reading an article on graphics by some computer scientists that was well-written but to me lacked a feeling for the linkages between substantive/statistical goals and graphical details. I have problems with these issues too, and my point here is not to criticize but to move the discussion forward. When thinking about visualization, how important are the details? Aleks pointed me to this article by Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky, “A Tour through the Visualization Zoo: A survey of powerful visualization techniques, from the obvious to the obscure.” Th
2 0.93193215 37 andrew gelman stats-2010-05-17-Is chartjunk really “more useful” than plain graphs? I don’t think so.
Introduction: Helen DeWitt links to this blog that reports on a study by Scott Bateman, Carl Gutwin, David McDine, Regan Mandryk, Aaron Genest, and Christopher Brooks that claims the following: Guidelines for designing information charts often state that the presentation should reduce ‘chart junk’–visual embellishments that are not essential to understanding the data. . . . we conducted an experiment that compared embellished charts with plain ones, and measured both interpretation accuracy and long-term recall. We found that people’s accuracy in describing the embellished charts was no worse than for plain charts, and that their recall after a two-to-three-week gap was significantly better. As the above-linked blogger puts it, “chartjunk is more useful than plain graphs. . . . Tufte is not going to like this.” I can’t speak for Ed Tufte, but I’m not gonna take this claim about chartjunk lying down. I have two points to make which I hope can stop the above-linked study from being sla
Introduction: I continue to struggle to convey my thoughts on statistical graphics so I’ll try another approach, this time giving my own story. For newcomers to this discussion: the background is that Antony Unwin and I wrote an article on the different goals embodied in information visualization and statistical graphics, but I have difficulty communicating on this point with the infovis people. Maybe if I tell my own story, and then they tell their stories, this will point a way forward to a more constructive discussion. So here goes. I majored in physics in college and I worked in a couple of research labs during the summer. Physicists graph everything. I did most of my plotting on graph paper–this continued through my second year of grad school–and became expert at putting points at 1/5, 2/5, 3/5, and 4/5 between the x and y grid lines. In grad school in statistics, I continued my physics habits and graphed everything I could. I did notice, though, that the faculty and the other
4 0.92620611 1684 andrew gelman stats-2013-01-20-Ugly ugly ugly
Introduction: Denis Cote sends the following , under the heading, “Some bad graphs for your enjoyment”: To start with, they don’t know how to spell “color.” Seriously, though, the graph is a mess. The circular display implies a circular or periodic structure that isn’t actually in the data, the cramped display requires the use of an otherwise-unnecessary color code that makes it difficult to find or make sense of the information, the alphabetical ordering (without even supplying state names, only abbreviations) makes it further difficult to find any patterns. It would be so much better, and even easier, to just display a set of small maps shading states on whether they have different laws. But that’s part of the problem—the clearer graph would also be easier to make! To get a distinctive graph, there needs to be some degree of difficulty. The designers continue with these monstrosities: Here they decide to display only 5 states at a time so that it’s really hard to see any big pi
Introduction: Jerzy Wieczorek has an interesting review of the book Graph Design for the Eye and Mind by psychology researcher Stephen Kosslyn. I recommend you read all of Wieczorek’s review (and maybe Kosslyn’s book, but that I haven’t seen), but here I’ll just focus on one point. Here’s Wieczorek summarizing Kosslyn: p. 18-19: the horizontal axis should be for the variable with the “most important part of the data.” See Kosslyn’s Figure 1.6 and 1.7 below. Figure 1.6 clearly shows that one of the sex-by-income groups reacts to age differently than the other three groups do. Figure 1.7 uses sex as the x-axis variable, making it much harder to see this same effect in the data. As a statistician exploring the data, I might make several plots using different groupings… but for communicating my results to an audience, I would choose the one plot that shows the findings most clearly. Those who know me well (or who have read the title of this post) will guess my reaction, whic
6 0.90682185 1439 andrew gelman stats-2012-08-01-A book with a bunch of simple graphs
7 0.90172452 1764 andrew gelman stats-2013-03-15-How do I make my graphs?
8 0.89896691 2266 andrew gelman stats-2014-03-25-A statistical graphics course and statistical graphics advice
9 0.89599693 2246 andrew gelman stats-2014-03-13-An Economist’s Guide to Visualizing Data
10 0.89598149 488 andrew gelman stats-2010-12-27-Graph of the year
11 0.89510036 2154 andrew gelman stats-2013-12-30-Bill Gates’s favorite graph of the year
12 0.89147812 829 andrew gelman stats-2011-07-29-Infovis vs. statgraphics: A clear example of their different goals
14 0.88355082 1606 andrew gelman stats-2012-12-05-The Grinch Comes Back
15 0.87390047 319 andrew gelman stats-2010-10-04-“Who owns Congress”
16 0.85982931 1896 andrew gelman stats-2013-06-13-Against the myth of the heroic visualization
17 0.85792959 671 andrew gelman stats-2011-04-20-One more time-use graph
18 0.85529774 855 andrew gelman stats-2011-08-16-Infovis and statgraphics update update
20 0.84844774 296 andrew gelman stats-2010-09-26-A simple semigraphic display
topicId topicWeight
[(5, 0.046), (15, 0.041), (16, 0.078), (21, 0.026), (24, 0.17), (27, 0.014), (45, 0.02), (50, 0.084), (53, 0.024), (76, 0.017), (77, 0.017), (86, 0.032), (88, 0.01), (99, 0.307)]
simIndex simValue blogId blogTitle
Introduction: Jeff Ratto points me to this news article by Dean Baker reporting the work of three economists, Thomas Herndon, Michael Ash, and Robert Pollin, who found errors in a much-cited article by Carmen Reinhart and Kenneth Rogoff analyzing historical statistics of economic growth and public debt. Mike Konczal provides a clear summary; that’s where I got the above image. Errors in data processing and data analysis It turns out that Reinhart and Rogoff flubbed it. Herndon et al. write of “spreadsheet errors, omission of available data, weighting, and transcription.” The spreadsheet errors are the most embarrassing, but the other choices in data analysis seem pretty bad too. It can be tough to work with small datasets, so I have sympathy for Reinhart and Rogoff, but it does look like they were jumping to conclusions in their paper. Perhaps the urgency of the topic moved them to publish as fast as possible rather than carefully considering the impact of their data-analytic choi
same-blog 2 0.96856016 61 andrew gelman stats-2010-05-31-A data visualization manifesto
Introduction: Details matter (at least, they do for me), but we don’t yet have a systematic way of going back and forth between the structure of a graph, its details, and the underlying questions that motivate our visualizations. (Cleveland, Wilkinson, and others have written a bit on how to formalize these connections, and I’ve thought about it too, but we have a ways to go.) I was thinking about this difficulty after reading an article on graphics by some computer scientists that was well-written but to me lacked a feeling for the linkages between substantive/statistical goals and graphical details. I have problems with these issues too, and my point here is not to criticize but to move the discussion forward. When thinking about visualization, how important are the details? Aleks pointed me to this article by Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky, “A Tour through the Visualization Zoo: A survey of powerful visualization techniques, from the obvious to the obscure.” Th
3 0.96535474 1793 andrew gelman stats-2013-04-08-The Supreme Court meets the fallacy of the one-sided bet
Introduction: Doug Hartmann writes ( link from Jay Livingston): Justice Antonin Scalia’s comment in the Supreme Court hearings on the U.S. law defining marriage that “there’s considerable disagreement among sociologists as to what the consequences of raising a child in a single-sex family, whether that is harmful to the child or not.” Hartman argues that Scalia is factually incorrect—there is not actually “considerable disagreement among sociologists” on this issue—and quotes a recent report from the American Sociological Association to this effect. Assuming there’s no other considerable group of sociologists (Hartman knows of only one small group) arguing otherwise, it seems that Hartman has a point. Scalia would’ve been better off omitting the phrase “among sociologists”—then he’d have been on safe ground, because you can always find somebody to take a position on the issue. Jerry Falwell’s no longer around but there’s a lot more where he came from. Even among scientists, there’s
Introduction: During our discussion of estimates of teacher performance, Steve Sailer wrote : I suspect we’re going to take years to work the kinks out of overall rating systems. By way of analogy, Bill James kicked off the modern era of baseball statistics analysis around 1975. But he stuck to doing smaller scale analyses and avoided trying to build one giant overall model for rating players. In contrast, other analysts such as Pete Palmer rushed into building overall ranking systems, such as his 1984 book, but they tended to generate curious results such as the greatness of Roy Smalley Jr.. James held off until 1999 before unveiling his win share model for overall rankings. I remember looking at Pete Palmer’s book many years ago and being disappointed that he did everything through his Linear Weights formula. A hit is worth X, a walk is worth Y, etc. Some of this is good–it’s presumably an improvement on counting walks as 0 or 1 hits, also an improvement on counting doubles and triples a
5 0.9634881 1981 andrew gelman stats-2013-08-14-The robust beauty of improper linear models in decision making
Introduction: Andreas Graefe writes (see here here here ): The usual procedure for developing linear models to predict any kind of target variable is to identify a subset of most important predictors and to estimate weights that provide the best possible solution for a given sample. The resulting “optimally” weighted linear composite is then used when predicting new data. This approach is useful in situations with large and reliable datasets and few predictor variables. However, a large body of analytical and empirical evidence since the 1970s shows that the weighting of variables is of little, if any, value in situations with small and noisy datasets and a large number of predictor variables. In such situations, including all relevant variables is more important than their weighting. These findings have yet to impact many fields. This study uses data from nine established U.S. election-forecasting models whose forecasts are regularly published in academic journals to demonstrate the value o
6 0.96304893 210 andrew gelman stats-2010-08-16-What I learned from those tough 538 commenters
7 0.9618929 1713 andrew gelman stats-2013-02-08-P-values and statistical practice
9 0.95743465 1247 andrew gelman stats-2012-04-05-More philosophy of Bayes
12 0.95506883 1760 andrew gelman stats-2013-03-12-Misunderstanding the p-value
13 0.95492554 2013 andrew gelman stats-2013-09-08-What we need here is some peer review for statistical graphics
14 0.95451415 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes
15 0.95429695 120 andrew gelman stats-2010-06-30-You can’t put Pandora back in the box
16 0.95429075 2149 andrew gelman stats-2013-12-26-Statistical evidence for revised standards
17 0.95424384 391 andrew gelman stats-2010-11-03-Some thoughts on election forecasting
18 0.95423615 2080 andrew gelman stats-2013-10-28-Writing for free
19 0.95421946 1162 andrew gelman stats-2012-02-11-Adding an error model to a deterministic model