andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-502 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: David Afshartous writes: I thought this graph [from Ed Easterling] might be good for your blog. The 71 outlined squares show the main story, and the regions of the graph present the information nicely. Looks like the bins for the color coding are not of equal size and of course the end bins are unbounded. Might be interesting to graph the distribution of the actual data for the 71 outlined squares. In addition, I assume that each period begins on Jan 1 so data size could be naturally increased by looking at intervals that start on June 1 as well (where the limit of this process would be to have it at the granularity of one day; while it most likely wouldn’t make much difference, I’ve seen some graphs before where 1 year returns can be quite sensitive to starting date, etc). I agree that (a) the graph could be improved in small ways–in particular, adding half-year data seems like a great idea–and (b) it’s a wonderful, wonderful graph as is. And the NYT graphics people ad
sentIndex sentText sentNum sentScore
1 David Afshartous writes: I thought this graph [from Ed Easterling] might be good for your blog. [sent-1, score-0.292]
2 The 71 outlined squares show the main story, and the regions of the graph present the information nicely. [sent-2, score-0.816]
3 Looks like the bins for the color coding are not of equal size and of course the end bins are unbounded. [sent-3, score-1.095]
4 Might be interesting to graph the distribution of the actual data for the 71 outlined squares. [sent-4, score-0.597]
5 I agree that (a) the graph could be improved in small ways–in particular, adding half-year data seems like a great idea–and (b) it’s a wonderful, wonderful graph as is. [sent-6, score-1.109]
6 And the NYT graphics people added some nice touches such as the gray (rather than white) background and the thin white lines to separate the decades. [sent-7, score-1.016]
7 On a (slightly) more substantive note, I don’t think growth-adjusted-for-inflation is the best benchmark. [sent-8, score-0.095]
8 Instead of growth minus inflation, I’d like to see growth minus the default interest rate you could get from a savings account or T-bill or something like that. [sent-9, score-1.407]
wordName wordTfidf (topN-words)
[('outlined', 0.305), ('bins', 0.294), ('graph', 0.292), ('minus', 0.242), ('growth', 0.2), ('wonderful', 0.197), ('white', 0.166), ('granularity', 0.152), ('touches', 0.147), ('size', 0.137), ('savings', 0.131), ('jan', 0.128), ('thin', 0.126), ('inflation', 0.119), ('june', 0.114), ('regions', 0.114), ('returns', 0.112), ('gray', 0.112), ('possibilities', 0.11), ('coding', 0.109), ('squares', 0.105), ('ed', 0.104), ('sensitive', 0.104), ('date', 0.104), ('nyt', 0.102), ('naturally', 0.099), ('limit', 0.097), ('color', 0.096), ('substantive', 0.095), ('improved', 0.094), ('begins', 0.093), ('equal', 0.092), ('intervals', 0.09), ('default', 0.088), ('adding', 0.087), ('increased', 0.087), ('period', 0.085), ('account', 0.084), ('separate', 0.081), ('slightly', 0.081), ('starting', 0.079), ('nice', 0.079), ('addition', 0.078), ('graphics', 0.078), ('added', 0.077), ('lines', 0.077), ('could', 0.074), ('like', 0.073), ('background', 0.073), ('etc', 0.072)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 502 andrew gelman stats-2011-01-04-Cash in, cash out graph
Introduction: David Afshartous writes: I thought this graph [from Ed Easterling] might be good for your blog. The 71 outlined squares show the main story, and the regions of the graph present the information nicely. Looks like the bins for the color coding are not of equal size and of course the end bins are unbounded. Might be interesting to graph the distribution of the actual data for the 71 outlined squares. In addition, I assume that each period begins on Jan 1 so data size could be naturally increased by looking at intervals that start on June 1 as well (where the limit of this process would be to have it at the granularity of one day; while it most likely wouldn’t make much difference, I’ve seen some graphs before where 1 year returns can be quite sensitive to starting date, etc). I agree that (a) the graph could be improved in small ways–in particular, adding half-year data seems like a great idea–and (b) it’s a wonderful, wonderful graph as is. And the NYT graphics people ad
2 0.18222094 2266 andrew gelman stats-2014-03-25-A statistical graphics course and statistical graphics advice
Introduction: Dean Eckles writes: Some of my coworkers at Facebook and I have worked with Udacity to create an online course on exploratory data analysis, including using data visualizations in R as part of EDA. The course has now launched at https://www.udacity.com/course/ud651 so anyone can take it for free. And Kaiser Fung has reviewed it . So definitely feel free to promote it! Criticism is also welcome (we are still fine-tuning things and adding more notes throughout). I wrote some more comments about the course here , including highlighting the interviews with my great coworkers. I didn’t have a chance to look at the course so instead I responded with some generic comments about eda and visualization (in no particular order): - Think of a graph as a comparison. All graphs are comparison (indeed, all statistical analyses are comparisons). If you already have the graph in mind, think of what comparisons it’s enabling. Or if you haven’t settled on the graph yet, think of what
Introduction: I continue to struggle to convey my thoughts on statistical graphics so I’ll try another approach, this time giving my own story. For newcomers to this discussion: the background is that Antony Unwin and I wrote an article on the different goals embodied in information visualization and statistical graphics, but I have difficulty communicating on this point with the infovis people. Maybe if I tell my own story, and then they tell their stories, this will point a way forward to a more constructive discussion. So here goes. I majored in physics in college and I worked in a couple of research labs during the summer. Physicists graph everything. I did most of my plotting on graph paper–this continued through my second year of grad school–and became expert at putting points at 1/5, 2/5, 3/5, and 4/5 between the x and y grid lines. In grad school in statistics, I continued my physics habits and graphed everything I could. I did notice, though, that the faculty and the other
4 0.15648691 1376 andrew gelman stats-2012-06-12-Simple graph WIN: the example of birthday frequencies
Introduction: From Chris Mulligan: The data come from the Center for Disease Control and cover the years 1969-1988. Chris also gives instructions for how to download the data and plot them in R from scratch (in 30 lines of R code)! And now, the background A few months ago I heard about a study reporting that, during a recent eleven-year period, more babies were born on Valentine’s Day and fewer on Halloween compared to neighboring days: I wrote , What I’d really like to see is a graph with all 366 days of the year. It would be easy enough to make. That way we could put the Valentine’s and Halloween data in the context of other possible patterns. While they’re at it, they could also graph births by day of the week and show Thanksgiving, Easter, and other holidays that don’t have fixed dates. It’s so frustrating when people only show part of the story. I was pointed to some tables: and a graph from Matt Stiles: The heatmap is cute but I wanted to se
Introduction: We haven’t had one of these in awhile, having mostly switched to the “chess trivia” and “bad p-values” genres of blogging . . . But I had to come back to the topic after receiving this note from Raghuveer Parthasarathy: Here’s another bad graph you might like. It might (arguably) be even worse than the “worst graphs of the year” you’ve blogged about, since rather than being a poor representation of data, it is simply the plotting of a tautology that mistakenly gives the impression of being data. (And it’s in Nature.) Parthasarathy explains: On the vertical axis we have the probability of being Type 2 Diabetic (T2D). On the horizontal axis we have the probability of being normal. There’s a clear, important trend evident, right? No! The probability of being normal is trivially one minus the probability of being T2D! The graph could not possibly be anything other than a straight line of slope -1. (For the students out there: the complete lack of scatter in the graph is
6 0.13193797 61 andrew gelman stats-2010-05-31-A data visualization manifesto
7 0.12793663 2186 andrew gelman stats-2014-01-26-Infoviz on top of stat graphic on top of spreadsheet
8 0.12439603 1584 andrew gelman stats-2012-11-19-Tradeoffs in information graphics
10 0.11871073 855 andrew gelman stats-2011-08-16-Infovis and statgraphics update update
11 0.11519343 2308 andrew gelman stats-2014-04-27-White stripes and dead armadillos
12 0.11498006 1764 andrew gelman stats-2013-03-15-How do I make my graphs?
13 0.11100189 2154 andrew gelman stats-2013-12-30-Bill Gates’s favorite graph of the year
16 0.10407463 2146 andrew gelman stats-2013-12-24-NYT version of birthday graph
19 0.096252277 829 andrew gelman stats-2011-07-29-Infovis vs. statgraphics: A clear example of their different goals
20 0.095883168 583 andrew gelman stats-2011-02-21-An interesting assignment for statistical graphics
topicId topicWeight
[(0, 0.171), (1, -0.03), (2, 0.016), (3, 0.072), (4, 0.142), (5, -0.154), (6, -0.083), (7, 0.075), (8, -0.052), (9, -0.019), (10, -0.009), (11, -0.005), (12, -0.023), (13, -0.002), (14, 0.016), (15, -0.005), (16, 0.042), (17, -0.005), (18, 0.006), (19, -0.014), (20, 0.023), (21, 0.046), (22, -0.023), (23, 0.006), (24, 0.023), (25, -0.027), (26, 0.017), (27, -0.009), (28, -0.02), (29, 0.001), (30, 0.046), (31, -0.013), (32, -0.085), (33, -0.034), (34, -0.022), (35, -0.013), (36, -0.039), (37, -0.044), (38, -0.013), (39, 0.03), (40, -0.007), (41, -0.011), (42, 0.036), (43, 0.003), (44, -0.008), (45, 0.026), (46, 0.062), (47, -0.004), (48, -0.034), (49, -0.022)]
simIndex simValue blogId blogTitle
same-blog 1 0.9704563 502 andrew gelman stats-2011-01-04-Cash in, cash out graph
Introduction: David Afshartous writes: I thought this graph [from Ed Easterling] might be good for your blog. The 71 outlined squares show the main story, and the regions of the graph present the information nicely. Looks like the bins for the color coding are not of equal size and of course the end bins are unbounded. Might be interesting to graph the distribution of the actual data for the 71 outlined squares. In addition, I assume that each period begins on Jan 1 so data size could be naturally increased by looking at intervals that start on June 1 as well (where the limit of this process would be to have it at the granularity of one day; while it most likely wouldn’t make much difference, I’ve seen some graphs before where 1 year returns can be quite sensitive to starting date, etc). I agree that (a) the graph could be improved in small ways–in particular, adding half-year data seems like a great idea–and (b) it’s a wonderful, wonderful graph as is. And the NYT graphics people ad
2 0.91867286 829 andrew gelman stats-2011-07-29-Infovis vs. statgraphics: A clear example of their different goals
Introduction: I recently came across a data visualization that perfectly demonstrates the difference between the “infovis” and “statgraphics” perspectives. Here’s the image ( link from Tyler Cowen): That’s the infovis. The statgraphic version would simply be a dotplot, something like this: (I purposely used the default settings in R with only minor modifications here to demonstrate what happens if you just want to plot the data with minimal effort.) Let’s compare the two graphs: From a statistical graphics perspective, the second graph dominates. The countries are directly comparable and the numbers are indicated by positions rather than area. The first graph is full of distracting color and gives the misleading visual impression that the total GDP of countries 5-10 is about equal to that of countries 1-4. If the goal is to get attention , though, it’s another story. There’s nothing special about the top graph above except how it looks. It represents neither a dat
3 0.90196604 1684 andrew gelman stats-2013-01-20-Ugly ugly ugly
Introduction: Denis Cote sends the following , under the heading, “Some bad graphs for your enjoyment”: To start with, they don’t know how to spell “color.” Seriously, though, the graph is a mess. The circular display implies a circular or periodic structure that isn’t actually in the data, the cramped display requires the use of an otherwise-unnecessary color code that makes it difficult to find or make sense of the information, the alphabetical ordering (without even supplying state names, only abbreviations) makes it further difficult to find any patterns. It would be so much better, and even easier, to just display a set of small maps shading states on whether they have different laws. But that’s part of the problem—the clearer graph would also be easier to make! To get a distinctive graph, there needs to be some degree of difficulty. The designers continue with these monstrosities: Here they decide to display only 5 states at a time so that it’s really hard to see any big pi
4 0.90061551 671 andrew gelman stats-2011-04-20-One more time-use graph
Introduction: Evan Hensleigh sens me this redesign of the cross-national time use graph : Here was my version: And here was the original: Compared to my graph, Evan’s has better fonts, and that’s important–good fonts can make a display look professional. But I’m not sure about his other innovations. To me, the different colors for the different time-use categories are more of a distraction than a visual aid, and I also don’t like how he made the bars fatter. As I noted in my earlier entry, to me this draws unwanted attention to the negative space between the bars. His country labels are slightly misaligned (particularly Japan and USA), and I really don’t like his horizontal axis at all! He removed the units of hours and put + and – on the edges so that the axes run into each other. What was the point of that? It’s bad news. Also I don’t see any advantage at all to the prehensile tick marks. On the other hand, if Evgn and I were working together on such a graph, we w
5 0.89111304 1376 andrew gelman stats-2012-06-12-Simple graph WIN: the example of birthday frequencies
Introduction: From Chris Mulligan: The data come from the Center for Disease Control and cover the years 1969-1988. Chris also gives instructions for how to download the data and plot them in R from scratch (in 30 lines of R code)! And now, the background A few months ago I heard about a study reporting that, during a recent eleven-year period, more babies were born on Valentine’s Day and fewer on Halloween compared to neighboring days: I wrote , What I’d really like to see is a graph with all 366 days of the year. It would be easy enough to make. That way we could put the Valentine’s and Halloween data in the context of other possible patterns. While they’re at it, they could also graph births by day of the week and show Thanksgiving, Easter, and other holidays that don’t have fixed dates. It’s so frustrating when people only show part of the story. I was pointed to some tables: and a graph from Matt Stiles: The heatmap is cute but I wanted to se
6 0.88760459 2154 andrew gelman stats-2013-12-30-Bill Gates’s favorite graph of the year
7 0.88504303 488 andrew gelman stats-2010-12-27-Graph of the year
8 0.86110032 915 andrew gelman stats-2011-09-17-(Worst) graph of the year
9 0.85869914 2146 andrew gelman stats-2013-12-24-NYT version of birthday graph
11 0.84748298 1253 andrew gelman stats-2012-04-08-Technology speedup graph
12 0.84560835 2266 andrew gelman stats-2014-03-25-A statistical graphics course and statistical graphics advice
13 0.84234518 1258 andrew gelman stats-2012-04-10-Why display 6 years instead of 30?
14 0.84048724 1011 andrew gelman stats-2011-11-15-World record running times vs. distance
16 0.83672458 672 andrew gelman stats-2011-04-20-The R code for those time-use graphs
18 0.83448511 1439 andrew gelman stats-2012-08-01-A book with a bunch of simple graphs
19 0.8305698 1613 andrew gelman stats-2012-12-09-Hey—here’s a photo of me making fun of a silly infographic (from last year)
20 0.82877415 1669 andrew gelman stats-2013-01-12-The power of the puzzlegraph
topicId topicWeight
[(2, 0.022), (16, 0.074), (20, 0.011), (21, 0.077), (23, 0.027), (24, 0.2), (34, 0.022), (36, 0.051), (54, 0.072), (55, 0.01), (60, 0.014), (63, 0.019), (71, 0.012), (76, 0.017), (79, 0.015), (86, 0.011), (95, 0.028), (99, 0.233)]
simIndex simValue blogId blogTitle
same-blog 1 0.96196014 502 andrew gelman stats-2011-01-04-Cash in, cash out graph
Introduction: David Afshartous writes: I thought this graph [from Ed Easterling] might be good for your blog. The 71 outlined squares show the main story, and the regions of the graph present the information nicely. Looks like the bins for the color coding are not of equal size and of course the end bins are unbounded. Might be interesting to graph the distribution of the actual data for the 71 outlined squares. In addition, I assume that each period begins on Jan 1 so data size could be naturally increased by looking at intervals that start on June 1 as well (where the limit of this process would be to have it at the granularity of one day; while it most likely wouldn’t make much difference, I’ve seen some graphs before where 1 year returns can be quite sensitive to starting date, etc). I agree that (a) the graph could be improved in small ways–in particular, adding half-year data seems like a great idea–and (b) it’s a wonderful, wonderful graph as is. And the NYT graphics people ad
2 0.94413286 2029 andrew gelman stats-2013-09-18-Understanding posterior p-values
Introduction: David Kaplan writes: I came across your paper “Understanding Posterior Predictive P-values”, and I have a question regarding your statement “If a posterior predictive p-value is 0.4, say, that means that, if we believe the model, we think there is a 40% chance that tomorrow’s value of T(y_rep) will exceed today’s T(y).” This is perfectly understandable to me and represents the idea of calibration. However, I am unsure how this relates to statements about fit. If T is the LR chi-square or Pearson chi-square, then your statement that there is a 40% chance that tomorrows value exceeds today’s value indicates bad fit, I think. Yet, some literature indicates that high p-values suggest good fit. Could you clarify this? My reply: I think that “fit” depends on the question being asked. In this case, I’d say the model fits for this particular purpose, even though it might not fit for other purposes. And here’s the abstract of the paper: Posterior predictive p-values do not i
3 0.93909097 1080 andrew gelman stats-2011-12-24-Latest in blog advertising
Introduction: I received the following message from “Patricia Lopez” of “Premium Link Ads”: Hello, I am interested in placing a text link on your page: http://andrewgelman.com/2011/07/super_sam_fuld/. The link would point to a page on a website that is relevant to your page and may be useful to your site visitors. We would be happy to compensate you for your time if it is something we are able to work out. The best way to reach me is through a direct response to this email. This will help me get back to you about the right link request. Please let me know if you are interested, and if not thanks for your time. Thanks. Usually I just ignore these, but after our recent discussion I decided to reply. I wrote: How much do you pay? But no answer. I wonder what’s going on? I mean, why bother sending the email in the first place if you’re not going to follow up?
4 0.93385327 574 andrew gelman stats-2011-02-14-“The best data visualizations should stand on their own”? I don’t think so.
Introduction: Jimmy pointed me to this blog by Drew Conway on word clouds. I don’t have much to say about Conway’s specifics–word clouds aren’t really my thing, but I’m glad that people are thinking about how to do them better–but I did notice one phrase of his that I’ll dispute. Conway writes The best data visualizations should stand on their own . . . I disagree. I prefer the saying, “A picture plus 1000 words is better than two pictures or 2000 words.” That is, I see a positive interaction between words and pictures or, to put it another way, diminishing returns for words or pictures on their own. I don’t have any big theory for this, but I think, when expressed as a joint value function, my idea makes sense. Also, I live this suggestion in my own work. I typically accompany my graphs with long captions and I try to accompany my words with pictures (although I’m not doing it here, because with the software I use, it’s much easier to type more words than to find, scale, and insert i
5 0.93307573 810 andrew gelman stats-2011-07-20-Adding more information can make the variance go up (depending on your model)
Introduction: Andy McKenzie writes: In their March 9 “ counterpoint ” in nature biotech to the prospect that we should try to integrate more sources of data in clinical practice (see “ point ” arguing for this), Isaac Kohane and David Margulies claim that, “Finally, how much better is our new knowledge than older knowledge? When is the incremental benefit of a genomic variant(s) or gene expression profile relative to a family history or classic histopathology insufficient and when does it add rather than subtract variance?” Perhaps I am mistaken (thus this email), but it seems that this claim runs contra to the definition of conditional probability. That is, if you have a hierarchical model, and the family history / classical histopathology already suggests a parameter estimate with some variance, how could the new genomic info possibly increase the variance of that parameter estimate? Surely the question is how much variance the new genomic info reduces and whether it therefore justifies t
7 0.93106848 1757 andrew gelman stats-2013-03-11-My problem with the Lindley paradox
8 0.93029606 896 andrew gelman stats-2011-09-09-My homework success
10 0.92928809 1881 andrew gelman stats-2013-06-03-Boot
11 0.92915118 1792 andrew gelman stats-2013-04-07-X on JLP
12 0.92853993 1607 andrew gelman stats-2012-12-05-The p-value is not . . .
13 0.9280414 898 andrew gelman stats-2011-09-10-Fourteen magic words: an update
14 0.92706347 2086 andrew gelman stats-2013-11-03-How best to compare effects measured in two different time periods?
15 0.92695653 1584 andrew gelman stats-2012-11-19-Tradeoffs in information graphics
16 0.92635345 2312 andrew gelman stats-2014-04-29-Ken Rice presents a unifying approach to statistical inference and hypothesis testing
17 0.92615229 1240 andrew gelman stats-2012-04-02-Blogads update
20 0.92371011 1155 andrew gelman stats-2012-02-05-What is a prior distribution?