andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-305 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Dan Goldstein sends along this bit of research , distinguishing terms used in two different subfields of psychology. Dan writes: Intuitive calls included not listing words that don’t occur 3 or more times in both programs. I [Dan] did this because when I looked at the results, those cases tended to be proper names or arbitrary things like header or footer text. It also narrowed down the space of words to inspect, which means I could actually get the thing done in my copious free time. I think the bar graphs are kinda ugly, maybe there’s a better way to do it based on classifying the words according to content? Also the whole exercise would gain a new dimension by comparing several areas instead of just two. Maybe that’s coming next.
sentIndex sentText sentNum sentScore
1 Dan Goldstein sends along this bit of research , distinguishing terms used in two different subfields of psychology. [sent-1, score-0.762]
2 Dan writes: Intuitive calls included not listing words that don’t occur 3 or more times in both programs. [sent-2, score-0.875]
3 I [Dan] did this because when I looked at the results, those cases tended to be proper names or arbitrary things like header or footer text. [sent-3, score-1.194]
4 It also narrowed down the space of words to inspect, which means I could actually get the thing done in my copious free time. [sent-4, score-0.659]
5 I think the bar graphs are kinda ugly, maybe there’s a better way to do it based on classifying the words according to content? [sent-5, score-1.091]
6 Also the whole exercise would gain a new dimension by comparing several areas instead of just two. [sent-6, score-0.843]
wordName wordTfidf (topN-words)
[('dan', 0.361), ('words', 0.26), ('footer', 0.226), ('header', 0.226), ('inspect', 0.226), ('classifying', 0.191), ('subfields', 0.186), ('distinguishing', 0.186), ('goldstein', 0.175), ('tended', 0.169), ('listing', 0.166), ('kinda', 0.164), ('intuitive', 0.162), ('bar', 0.144), ('dimension', 0.142), ('arbitrary', 0.141), ('proper', 0.139), ('exercise', 0.136), ('ugly', 0.135), ('occur', 0.135), ('calls', 0.134), ('gain', 0.13), ('sends', 0.118), ('names', 0.117), ('content', 0.114), ('included', 0.108), ('comparing', 0.107), ('space', 0.106), ('areas', 0.101), ('maybe', 0.1), ('according', 0.093), ('looked', 0.093), ('terms', 0.089), ('coming', 0.084), ('graphs', 0.083), ('cases', 0.083), ('whole', 0.083), ('free', 0.083), ('means', 0.081), ('next', 0.075), ('along', 0.074), ('instead', 0.073), ('times', 0.072), ('several', 0.071), ('done', 0.069), ('results', 0.063), ('also', 0.06), ('based', 0.056), ('bit', 0.055), ('used', 0.054)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 305 andrew gelman stats-2010-09-29-Decision science vs. social psychology
Introduction: Dan Goldstein sends along this bit of research , distinguishing terms used in two different subfields of psychology. Dan writes: Intuitive calls included not listing words that don’t occur 3 or more times in both programs. I [Dan] did this because when I looked at the results, those cases tended to be proper names or arbitrary things like header or footer text. It also narrowed down the space of words to inspect, which means I could actually get the thing done in my copious free time. I think the bar graphs are kinda ugly, maybe there’s a better way to do it based on classifying the words according to content? Also the whole exercise would gain a new dimension by comparing several areas instead of just two. Maybe that’s coming next.
2 0.22892779 190 andrew gelman stats-2010-08-07-Mister P makes the big jump from the New York Times to the Washington Post
Introduction: See paragraphs 13-15 of this article by Dan Balz.
3 0.16424505 1104 andrew gelman stats-2012-01-07-A compelling reason to go to London, Ontario??
Introduction: Dan Goldstein asks what I think of this : My reply: It’s hard for me to imagine a compelling reason for anyone to go to London, Ontario–but, hey, I guess there’s all kinds of people in this world! More seriously, I see the appeal of the graph but it’s a bit busy for my taste. Over the years I’ve moved toward small multiples rather than single busy graphs. That’s one reason why I prefer Tufte’s second book to his first book. The Napoleon-in-Russia graph is a bad model, in that inspires people to try to cram lots of variables on a single graph. Dan wrote back: I [Dan] like it as a travel planning graph, it gives you what you want to know (how how will the days be, how cold will the nights be, will it rain) but is a bit easier on the brain than a table of highs and lows. Also makes it easy to see the trend. I agree the 2nd axis doesn’t help.
4 0.16216826 29 andrew gelman stats-2010-05-12-Probability of successive wins in baseball
Introduction: Dan Goldstein did an informal study asking people the following question: When two baseball teams play each other on two consecutive days, what is the probability that the winner of the first game will be the winner of the second game? You can make your own guess and the continue reading below. Dan writes: We asked two colleagues knowledgeable in baseball and the mathematics of forecasting. The answers came in between 65% and 70%. The true answer [based on Dan's analysis of a database of baseball games]: 51.3%, a little better than a coin toss. I have to say, I’m surprised his colleagues gave such extreme guesses. I was guessing something like 50%, myself, based on the following very crude reasoning: Suppose two unequal teams are playing, and the chance of team A beating team B is 55%. (This seems like a reasonable average of all matchups, which will include some more extreme disparities but also many more equal contests.) Then the chance of the same team
Introduction: 1. I remarked that Sharad had a good research article with some ugly graphs. 2. Dan posted Sharad’s graph and some unpleasant alternatives, inadvertently associating me with one of the unpleasant alternatives. Dan was comparing barplots with dotplots. 3. I commented on Dan’s site that, in this case, I’d much prefer a well-designed lineplot. I wrote: There’s a principle in decision analysis that the most important step is not the evaluation of the decision tree but the decision of what options to include in the tree in the first place. I think that’s what’s happening here. You’re seriously limiting yourself by considering the above options, which really are all the same graph with just slight differences in format. What you need to do is break outside the box. (Graph 2-which I think you think is the kind of thing that Gelman would like-indeed is the kind of thing that I think the R gurus like, but I don’t like it at all . It looks clean without actually being clea
6 0.11945999 126 andrew gelman stats-2010-07-03-Graphical presentation of risk ratios
7 0.11714876 509 andrew gelman stats-2011-01-09-Chartjunk, but in a good cause!
8 0.11473257 2022 andrew gelman stats-2013-09-13-You heard it here first: Intense exercise can suppress appetite
9 0.11148158 77 andrew gelman stats-2010-06-09-Sof[t]
10 0.10954157 687 andrew gelman stats-2011-04-29-Zero is zero
11 0.10719755 455 andrew gelman stats-2010-12-07-Some ideas on communicating risks to the general public
12 0.0992397 574 andrew gelman stats-2011-02-14-“The best data visualizations should stand on their own”? I don’t think so.
14 0.09089613 2211 andrew gelman stats-2014-02-14-The popularity of certain baby names is falling off the clifffffffffffff
15 0.089258239 863 andrew gelman stats-2011-08-21-Bad graph
16 0.084635392 1919 andrew gelman stats-2013-06-29-R sucks
17 0.084385037 1364 andrew gelman stats-2012-06-04-Massive confusion about a study that purports to show that exercise may increase heart risk
18 0.08263234 207 andrew gelman stats-2010-08-14-Pourquoi Google search est devenu plus raisonnable?
19 0.081283286 1932 andrew gelman stats-2013-07-10-Don’t trust the Turk
20 0.078453213 1090 andrew gelman stats-2011-12-28-“. . . extending for dozens of pages”
topicId topicWeight
[(0, 0.119), (1, -0.04), (2, -0.012), (3, 0.021), (4, 0.055), (5, -0.051), (6, 0.004), (7, 0.005), (8, -0.011), (9, -0.006), (10, -0.0), (11, -0.01), (12, 0.002), (13, -0.019), (14, -0.027), (15, 0.028), (16, 0.05), (17, -0.004), (18, -0.016), (19, -0.036), (20, -0.027), (21, 0.017), (22, -0.02), (23, 0.002), (24, 0.014), (25, -0.03), (26, -0.005), (27, 0.004), (28, 0.006), (29, -0.032), (30, -0.028), (31, -0.019), (32, -0.036), (33, -0.031), (34, -0.054), (35, -0.028), (36, 0.001), (37, -0.009), (38, -0.003), (39, -0.018), (40, 0.025), (41, -0.054), (42, 0.027), (43, 0.059), (44, -0.037), (45, -0.015), (46, -0.005), (47, 0.117), (48, -0.001), (49, 0.016)]
simIndex simValue blogId blogTitle
same-blog 1 0.94149828 305 andrew gelman stats-2010-09-29-Decision science vs. social psychology
Introduction: Dan Goldstein sends along this bit of research , distinguishing terms used in two different subfields of psychology. Dan writes: Intuitive calls included not listing words that don’t occur 3 or more times in both programs. I [Dan] did this because when I looked at the results, those cases tended to be proper names or arbitrary things like header or footer text. It also narrowed down the space of words to inspect, which means I could actually get the thing done in my copious free time. I think the bar graphs are kinda ugly, maybe there’s a better way to do it based on classifying the words according to content? Also the whole exercise would gain a new dimension by comparing several areas instead of just two. Maybe that’s coming next.
Introduction: We haven’t had one of these in awhile, having mostly switched to the “chess trivia” and “bad p-values” genres of blogging . . . But I had to come back to the topic after receiving this note from Raghuveer Parthasarathy: Here’s another bad graph you might like. It might (arguably) be even worse than the “worst graphs of the year” you’ve blogged about, since rather than being a poor representation of data, it is simply the plotting of a tautology that mistakenly gives the impression of being data. (And it’s in Nature.) Parthasarathy explains: On the vertical axis we have the probability of being Type 2 Diabetic (T2D). On the horizontal axis we have the probability of being normal. There’s a clear, important trend evident, right? No! The probability of being normal is trivially one minus the probability of being T2D! The graph could not possibly be anything other than a straight line of slope -1. (For the students out there: the complete lack of scatter in the graph is
3 0.67273438 832 andrew gelman stats-2011-07-31-Even a good data display can sometimes be improved
Introduction: When I first saw this graphic, I thought “boy, that’s great, sometimes the graphic practically makes itself.” Normally it’s hard to use lots of different colors to differentiate items of interest, because there’s usually not an intuitive mapping between color and item (e.g. for countries, or states, or whatever). But the colors of crayons, what could be more perfect? So this graphic seemed awesome. But, as they discovered after some experimentation at datapointed.net there is an even BETTER possibility here. Click the link to see. Crayola Crayon colors by year
4 0.66721314 1747 andrew gelman stats-2013-03-03-More research on the role of puzzles in processing data graphics
Introduction: Ruth Rosenholtz of the department of Brain and Cognitive Science at MIT writes: We mostly do computational modeling of human vision. We try to do on the one hand the sort of basic science that fits in the human vision community, while on the other hand developing predictive models which might actually lend insight into design. Your talk resonated with me in part because of this paper [Do Predictions of Visual Perception Aid Design?, by Ruth Rosenholtz, Amal Dorai, and Rosalind Freeman]. We went into our study thinking that people would like to have a quantitative tool to help analyze designs. But what we concluded, somewhat anecdotally, was that its main use seemed to be as a conversation-starter, and a means of communicating ideas about the design. And the reason it seemed to work is that our visualizations were the right level of a “puzzle” — challenging enough to be a bit fun to work out. On another topic, check out the infographic from last weekend’s NYTimes ma
5 0.66547275 1104 andrew gelman stats-2012-01-07-A compelling reason to go to London, Ontario??
Introduction: Dan Goldstein asks what I think of this : My reply: It’s hard for me to imagine a compelling reason for anyone to go to London, Ontario–but, hey, I guess there’s all kinds of people in this world! More seriously, I see the appeal of the graph but it’s a bit busy for my taste. Over the years I’ve moved toward small multiples rather than single busy graphs. That’s one reason why I prefer Tufte’s second book to his first book. The Napoleon-in-Russia graph is a bad model, in that inspires people to try to cram lots of variables on a single graph. Dan wrote back: I [Dan] like it as a travel planning graph, it gives you what you want to know (how how will the days be, how cold will the nights be, will it rain) but is a bit easier on the brain than a table of highs and lows. Also makes it easy to see the trend. I agree the 2nd axis doesn’t help.
6 0.65208161 126 andrew gelman stats-2010-07-03-Graphical presentation of risk ratios
7 0.65084255 787 andrew gelman stats-2011-07-05-Different goals, different looks: Infovis and the Chris Rock effect
8 0.64997137 672 andrew gelman stats-2011-04-20-The R code for those time-use graphs
9 0.64921224 687 andrew gelman stats-2011-04-29-Zero is zero
10 0.63727075 671 andrew gelman stats-2011-04-20-One more time-use graph
11 0.62974966 1154 andrew gelman stats-2012-02-04-“Turn a Boring Bar Graph into a 3D Masterpiece”
12 0.62153679 1439 andrew gelman stats-2012-08-01-A book with a bunch of simple graphs
14 0.61396039 2154 andrew gelman stats-2013-12-30-Bill Gates’s favorite graph of the year
16 0.60839921 1800 andrew gelman stats-2013-04-12-Too tired to mock
17 0.60697234 296 andrew gelman stats-2010-09-26-A simple semigraphic display
18 0.6054917 37 andrew gelman stats-2010-05-17-Is chartjunk really “more useful” than plain graphs? I don’t think so.
20 0.60054332 1862 andrew gelman stats-2013-05-18-uuuuuuuuuuuuugly
topicId topicWeight
[(16, 0.024), (24, 0.203), (42, 0.023), (53, 0.038), (63, 0.018), (86, 0.312), (99, 0.264)]
simIndex simValue blogId blogTitle
1 0.97857487 436 andrew gelman stats-2010-11-29-Quality control problems at the New York Times
Introduction: I guess there’s a reason they put this stuff in the Opinion section and not in the Science section, huh? P.S. More here .
2 0.96411252 1530 andrew gelman stats-2012-10-11-Migrating your blog from Movable Type to WordPress
Introduction: Cord Blomquist, who did a great job moving us from horrible Movable Type to nice nice WordPress, writes: I [Cord] wanted to share a little news with you related to the original work we did for you last year. When ReadyMadeWeb converted your Movable Type blog to WordPress, we got a lot of other requestes for the same service, so we started thinking about a bigger market for such a product. After a bit of research, we started work on automating the data conversion, writing rules, and exceptions to the rules, on how Movable Type and TypePad data could be translated to WordPress. After many months of work, we’re getting ready to announce TP2WP.com , a service that converts Movable Type and TypePad export files to WordPress import files, so anyone who wants to migrate to WordPress can do so easily and without losing permalinks, comments, images, or other files. By automating our service, we’ve been able to drop the price to just $99. I recommend it (and, no, Cord is not paying m
3 0.95641637 1427 andrew gelman stats-2012-07-24-More from the sister blog
Introduction: Anthropologist Bruce Mannheim reports that a recent well-publicized study on the genetics of native Americans, which used genetic analysis to find “at least three streams of Asian gene flow,” is in fact a confirmation of a long-known fact. Mannheim writes: This three-way distinction was known linguistically since the 1920s (for example, Sapir 1921). Basically, it’s a division among the Eskimo-Aleut languages, which straddle the Bering Straits even today, the Athabaskan languages (which were discovered to be related to a small Siberian language family only within the last few years, not by Greenberg as Wade suggested), and everything else. This is not to say that the results from genetics are unimportant, but it’s good to see how it fits with other aspects of our understanding.
4 0.95026708 253 andrew gelman stats-2010-09-03-Gladwell vs Pinker
Introduction: I just happened to notice this from last year. Eric Loken writes : Steven Pinker reviewed Malcolm Gladwell’s latest book and criticized him rather harshly for several shortcomings. Gladwell appears to have made things worse for himself in a letter to the editor of the NYT by defending a manifestly weak claim from one of his essays – the claim that NFL quarterback performance is unrelated to the order they were drafted out of college. The reason w [Loken and his colleagues] are implicated is that Pinker identified an earlier blog post of ours as one of three sources he used to challenge Gladwell (yay us!). But Gladwell either misrepresented or misunderstood our post in his response, and admonishes Pinker by saying “we should agree that our differences owe less to what can be found in the scientific literature than they do to what can be found on Google.” Well, here’s what you can find on Google. Follow this link to request the data for NFL quarterbacks drafted between 1980 and
5 0.94196272 1718 andrew gelman stats-2013-02-11-Toward a framework for automatic model building
Introduction: Patrick Caldon writes: I saw your recent blog post where you discussed in passing an iterative-chain-of models approach to AI. I essentially built such a thing for my PhD thesis – not in a Bayesian context, but in a logic programming context – and proved it had a few properties and showed how you could solve some toy problems. The important bit of my framework was that at various points you also go and get more data in the process – in a statistical context this might be seen as building a little univariate model on a subset of the data, then iteratively extending into a better model with more data and more independent variables – a generalized forward stepwise regression if you like. It wrapped a proper computational framework around E.M. Gold’s identification/learning in the limit based on a logic my advisor (Eric Martin) had invented. What’s not written up in the thesis is a few months of failed struggle trying to shoehorn some simple statistical inference into this
7 0.93510699 873 andrew gelman stats-2011-08-26-Luck or knowledge?
8 0.92829353 904 andrew gelman stats-2011-09-13-My wikipedia edit
9 0.92688894 76 andrew gelman stats-2010-06-09-Both R and Stata
same-blog 11 0.91984582 305 andrew gelman stats-2010-09-29-Decision science vs. social psychology
12 0.91275179 1547 andrew gelman stats-2012-10-25-College football, voting, and the law of large numbers
13 0.90339261 759 andrew gelman stats-2011-06-11-“2 level logit with 2 REs & large sample. computational nightmare – please help”
14 0.89280564 2082 andrew gelman stats-2013-10-30-Berri Gladwell Loken football update
15 0.89069116 2219 andrew gelman stats-2014-02-21-The world’s most popular languages that the Mac documentation hasn’t been translated into
16 0.88351619 2102 andrew gelman stats-2013-11-15-“Are all significant p-values created equal?”
17 0.88234568 558 andrew gelman stats-2011-02-05-Fattening of the world and good use of the alpha channel
18 0.88224137 276 andrew gelman stats-2010-09-14-Don’t look at just one poll number–unless you really know what you’re doing!
19 0.87768364 1971 andrew gelman stats-2013-08-07-I doubt they cheated
20 0.87267148 1983 andrew gelman stats-2013-08-15-More on AIC, WAIC, etc