andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-176 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Washington Post and Slate reporter Anne Applebaum wrote a dismissive column about Wikileaks, saying that they “offer nothing more than raw data.” Applebaum argues that “The notion that the Internet can replace traditional news-gathering has just been revealed to be a myth. . . . without more journalism, more investigation, more work, these documents just don’t matter that much.” Fine. But don’t undervalue the role of mere data! The usual story is that we don’t get to see the raw data underlying newspaper stories. Wikileaks and other crowdsourced data can be extremely useful, whether or not they replace “traditional news-gathering.”
sentIndex sentText sentNum sentScore
1 Washington Post and Slate reporter Anne Applebaum wrote a dismissive column about Wikileaks, saying that they “offer nothing more than raw data. [sent-1, score-0.84]
2 ” Applebaum argues that “The notion that the Internet can replace traditional news-gathering has just been revealed to be a myth. [sent-2, score-0.879]
3 without more journalism, more investigation, more work, these documents just don’t matter that much. [sent-6, score-0.288]
4 The usual story is that we don’t get to see the raw data underlying newspaper stories. [sent-9, score-0.735]
5 Wikileaks and other crowdsourced data can be extremely useful, whether or not they replace “traditional news-gathering. [sent-10, score-0.505]
wordName wordTfidf (topN-words)
[('applebaum', 0.47), ('wikileaks', 0.47), ('replace', 0.253), ('raw', 0.236), ('traditional', 0.217), ('anne', 0.186), ('dismissive', 0.181), ('investigation', 0.16), ('documents', 0.151), ('notion', 0.148), ('mere', 0.143), ('slate', 0.14), ('revealed', 0.138), ('journalism', 0.13), ('reporter', 0.129), ('argues', 0.123), ('washington', 0.12), ('internet', 0.116), ('newspaper', 0.113), ('column', 0.107), ('extremely', 0.104), ('role', 0.099), ('offer', 0.098), ('underlying', 0.096), ('data', 0.087), ('usual', 0.086), ('matter', 0.081), ('nothing', 0.068), ('useful', 0.068), ('saying', 0.065), ('whether', 0.061), ('story', 0.059), ('without', 0.056), ('post', 0.055), ('wrote', 0.054), ('work', 0.037), ('get', 0.03), ('see', 0.028)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 176 andrew gelman stats-2010-08-02-Information is good
Introduction: Washington Post and Slate reporter Anne Applebaum wrote a dismissive column about Wikileaks, saying that they “offer nothing more than raw data.” Applebaum argues that “The notion that the Internet can replace traditional news-gathering has just been revealed to be a myth. . . . without more journalism, more investigation, more work, these documents just don’t matter that much.” Fine. But don’t undervalue the role of mere data! The usual story is that we don’t get to see the raw data underlying newspaper stories. Wikileaks and other crowdsourced data can be extremely useful, whether or not they replace “traditional news-gathering.”
2 0.080569483 1745 andrew gelman stats-2013-03-02-Classification error
Introduction: 15-2040 != 19-3010 (and, for that matter, 25-1022 != 25-1063).
3 0.062093444 2236 andrew gelman stats-2014-03-07-Selection bias in the reporting of shaky research
Introduction: I’ll reorder this week’s posts a bit in order to continue on a topic that came up yesterday. A couple days ago a reporter wrote to me asking what I thought of this paper on Money, Status, and the Ovulatory Cycle. I responded: Given the quality of the earlier paper by these researchers, I’m not inclined to believe anything these people write. But, to be specific, I can point out some things: - The authors define low fertility as days 8-14. Oddly enough, these authors in their earlier paper used days 7-14. But according to womenshealth.gov, the most fertile days are between days 10 and 17. The choice of these days affects their analysis, and it is not a good sign that they use different days in different papers. (see more on this point in sections 2.3 and 3.1 of this paper: http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf) - They perform a lot of different analyses, and many others could be performed. For example, “Study 1 indicates that ovul
4 0.061088137 868 andrew gelman stats-2011-08-24-Blogs vs. real journalism
Introduction: I was thinking a bit more about Jonathan Rauch’s lament about the fading of the buggy-whip industry print journalism, in which he mocks bloggers, analogizes blogging to scribbling with spray paint on the side of a building, and writes that the blogosphere is “the single worst medium for sustained, and therefore grown-up, reading and writing and argumentation ever invented.” Yup. Worse than talk radio. Worse than cave painting. Worse than smoke signals, rock ‘n’ roll lyrics, woodcuts, spray-paint graffiti, and every other medium of communication ever invented. OK, he didn’t really mean it. Rauch actually has an ironclad argument here. He’s claiming, in a blog, that blogging is crap. Therefore, if he fills his blog with unsupported exaggerations, that’s fine, as he’s demonstrating that blogging is . . . crap. Not to pile on, but, hey, why not? I was curious what Rauch has blogged on lately, so I googled Jonathan Rauch blog and ended up at this site , which most recently
5 0.055817202 372 andrew gelman stats-2010-10-27-A use for tables (really)
Introduction: After our recent discussion of semigraphic displays, Jay Ulfelder sent along a semigraphic table from his recent book. He notes, “When countries are the units of analysis, it’s nice that you can use three-letter codes, so all the proper names have the same visual weight.” Ultimately I think that graphs win over tables for display. However in our work we spend a lot of time looking at raw data, often simply to understand what data we have. This use of tables has, I think, been forgotten in the statistical graphics literature. So I’d like to refocus the eternal tables vs. graphs discussion. If the goal is to present information, comparisons, relationships, models, data, etc etc, graphs win. Forget about tables. But . . . when you’re looking at your data, it can often help to see the raw numbers. Once you’re looking at numbers, it makes sense to organize them. Even a displayed matrix in R is a form of table, after all. And once you’re making a table, it can be sensible to
6 0.053647816 2012 andrew gelman stats-2013-09-07-Job openings at American University
7 0.052605178 1600 andrew gelman stats-2012-12-01-$241,364.83 – $13,000 = $228,364.83
8 0.051277161 1930 andrew gelman stats-2013-07-09-Symposium Magazine
9 0.050978176 1181 andrew gelman stats-2012-02-23-Philosophy: Pointer to Salmon
10 0.050066326 1802 andrew gelman stats-2013-04-14-Detecting predictability in complex ecosystems
11 0.047860336 514 andrew gelman stats-2011-01-13-News coverage of statistical issues…how did I do?
12 0.047857165 678 andrew gelman stats-2011-04-25-Democrats do better among the most and least educated groups
13 0.047266115 716 andrew gelman stats-2011-05-17-Is the internet causing half the rapes in Norway? I wanna see the scatterplot.
14 0.046201862 690 andrew gelman stats-2011-05-01-Peter Huber’s reflections on data analysis
15 0.046071634 1002 andrew gelman stats-2011-11-10-“Venetia Orcutt, GWU med school professor, quits after complaints of no-show class”
16 0.04605075 2107 andrew gelman stats-2013-11-20-NYT (non)-retraction watch
18 0.042730052 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?
19 0.042268191 408 andrew gelman stats-2010-11-11-Incumbency advantage in 2010
20 0.041696392 1363 andrew gelman stats-2012-06-03-Question about predictive checks
topicId topicWeight
[(0, 0.062), (1, -0.015), (2, -0.013), (3, 0.003), (4, -0.002), (5, -0.009), (6, -0.006), (7, 0.001), (8, 0.006), (9, 0.004), (10, -0.015), (11, 0.017), (12, -0.013), (13, 0.006), (14, -0.014), (15, 0.014), (16, -0.003), (17, -0.002), (18, 0.021), (19, 0.013), (20, -0.0), (21, -0.004), (22, -0.009), (23, -0.01), (24, -0.026), (25, 0.027), (26, 0.004), (27, 0.006), (28, 0.031), (29, 0.0), (30, 0.006), (31, 0.008), (32, -0.004), (33, 0.006), (34, 0.019), (35, 0.036), (36, -0.007), (37, -0.023), (38, 0.005), (39, 0.013), (40, 0.017), (41, -0.019), (42, 0.004), (43, -0.005), (44, 0.008), (45, 0.006), (46, -0.0), (47, -0.009), (48, -0.002), (49, -0.041)]
simIndex simValue blogId blogTitle
same-blog 1 0.90755391 176 andrew gelman stats-2010-08-02-Information is good
Introduction: Washington Post and Slate reporter Anne Applebaum wrote a dismissive column about Wikileaks, saying that they “offer nothing more than raw data.” Applebaum argues that “The notion that the Internet can replace traditional news-gathering has just been revealed to be a myth. . . . without more journalism, more investigation, more work, these documents just don’t matter that much.” Fine. But don’t undervalue the role of mere data! The usual story is that we don’t get to see the raw data underlying newspaper stories. Wikileaks and other crowdsourced data can be extremely useful, whether or not they replace “traditional news-gathering.”
2 0.71174377 1525 andrew gelman stats-2012-10-08-Ethical standards in different data communities
Introduction: I opened the paper today and saw this from Paul Krugman, on Jack Welch, the former chairman of General Electric, who posted an assertion on Twitter that the [recent unemployment data] had been cooked to help President Obama’s re-election campaign. His claim was quickly picked up by right-wing pundits and media personalities. It was nonsense, of course. Job numbers are prepared by professional civil servants, at an agency that currently has no political appointees. But then maybe Mr. Welch — under whose leadership G.E. reported remarkably smooth earnings growth, with none of the short-term fluctuations you might have expected (fluctuations that reappeared under his successor) — doesn’t know how hard it would be to cook the jobs data. I was curious so I googled *general electric historical earnings*. It was surprisingly difficult to find the numbers! Most of the links just went back to 2011, or to 2008. Eventually I came across this blog by Barry Ritholtz that showed this
Introduction: David Karger writes: Your recent post on sharing data was of great interest to me, as my own research in computer science asks how to incentivize and lower barriers to data sharing. I was particularly curious about your highlighting of effort as the major dis-incentive to sharing. I would love to hear more, as this question of effort is on we specifically target in our development of tools for data authoring and publishing. As a straw man, let me point out that sharing data technically requires no more than posting an excel spreadsheet online. And that you likely already produced that spreadsheet during your own analytic work. So, in what way does such low-tech publishing fail to meet your data sharing objectives? Our own hypothesis has been that the effort is really quite low, with the problem being a lack of *immediate/tangible* benefits (as opposed to the long-term values you accurately describe). To attack this problem, we’re developing tools (and, since it appear
4 0.70099956 2084 andrew gelman stats-2013-11-01-Doing Data Science: What’s it all about?
Introduction: Rachel Schutt and Cathy O’Neil just came out with a wonderfully readable book on doing data science, based on a course Rachel taught last year at Columbia. Rachel is a former Ph.D. student of mine and so I’m inclined to have a positive view of her work; on the other hand, I did actually look at the book and I did find it readable! What do I claim is the least important part of data science? Here’s what Schutt and O’Neil say regarding the title: “Data science is not just a rebranding of statistics or machine learning but rather a field unto itself.” I agree. There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. The question then arises: why do descriptions of data science focus so
5 0.68918425 1449 andrew gelman stats-2012-08-08-Gregor Mendel’s suspicious data
Introduction: Howard Wainer points me to a thoughtful discussion by Moti Nissani on “Psychological, Historical, and Ethical Reflections on the Mendelian Paradox.” The paradox, as Nissani defines it, is that Mendel’s data seem in many cases too good to be true, yet Mendel had a reputation for probity and it seems doubtful that he had a Mark-Hauser-style attitude toward reporting scientific data. Nissani writes: Taken together, the situation seems paradoxical. On the one hand, we have evidence that “the data of most, if not all, of the experiments have been falsified so as to agree closely with Mendel’s expectations.” We also have good reasons to believe that Mendel encountered linkage but failed to report it and that he may have taken the somewhat unusual step of having his scientific records destroyed shortly after his death. On the other hand, everything else we know about him/in addition to his undisputed genius/suggests a man of unimpeachable integrity, fine observational powers, and a pa
7 0.67475802 215 andrew gelman stats-2010-08-18-DataMarket
8 0.66605592 1920 andrew gelman stats-2013-06-30-“Non-statistical” statistics tools
9 0.65559518 192 andrew gelman stats-2010-08-08-Turning pages into data
10 0.6547817 2307 andrew gelman stats-2014-04-27-Big Data…Big Deal? Maybe, if Used with Caution.
11 0.65215325 1837 andrew gelman stats-2013-05-03-NYC Data Skeptics Meetup
12 0.65186751 724 andrew gelman stats-2011-05-21-New search engine for data & statistics
13 0.6492039 253 andrew gelman stats-2010-09-03-Gladwell vs Pinker
14 0.64460915 1974 andrew gelman stats-2013-08-08-Statistical significance and the dangerous lure of certainty
15 0.64326221 1805 andrew gelman stats-2013-04-16-Memo to Reinhart and Rogoff: I think it’s best to admit your errors and go on from there
17 0.63604724 1286 andrew gelman stats-2012-04-28-Agreement Groups in US Senate and Dynamic Clustering
18 0.635674 1906 andrew gelman stats-2013-06-19-“Behind a cancer-treatment firm’s rosy survival claims”
19 0.63269842 757 andrew gelman stats-2011-06-10-Controversy over the Christakis-Fowler findings on the contagion of obesity
20 0.63139731 471 andrew gelman stats-2010-12-17-Attractive models (and data) wanted for statistical art show.
topicId topicWeight
[(13, 0.026), (16, 0.079), (21, 0.088), (24, 0.072), (36, 0.333), (63, 0.021), (86, 0.079), (99, 0.149)]
simIndex simValue blogId blogTitle
1 0.93206531 2242 andrew gelman stats-2014-03-10-Stan Model of the Week: PK Calculation of IV and Oral Dosing
Introduction: [Update: Revised given comments from Wingfeet, Andrew and germo. Thanks! I'd mistakenly translated the dlnorm priors in the first version --- amazing what a difference the priors make. I also escaped the less-than and greater-than signs in the constraints in the model so they're visible. I also updated to match the thin=2 output of JAGS.] We’re going to be starting a Stan “model of the P” (for some time period P) column, so I thought I’d kick things off with one of my own. I’ve been following the Wingvoet blog , the author of which is identified only by the Blogger handle Wingfeet ; a couple of days ago this lovely post came out: PK calculation of IV and oral dosing in JAGS Wingfeet’s post implemented an answer to question 6 from chapter 6 of problem from Rowland and Tozer’s 2010 book, Clinical Pharmacokinetics and Pharmacodynamics , Fourth edition, Lippincott, Williams & Wilkins. So in the grand tradition of using this blog to procrastinate, I thought I’d t
same-blog 2 0.89410686 176 andrew gelman stats-2010-08-02-Information is good
Introduction: Washington Post and Slate reporter Anne Applebaum wrote a dismissive column about Wikileaks, saying that they “offer nothing more than raw data.” Applebaum argues that “The notion that the Internet can replace traditional news-gathering has just been revealed to be a myth. . . . without more journalism, more investigation, more work, these documents just don’t matter that much.” Fine. But don’t undervalue the role of mere data! The usual story is that we don’t get to see the raw data underlying newspaper stories. Wikileaks and other crowdsourced data can be extremely useful, whether or not they replace “traditional news-gathering.”
3 0.80914199 1797 andrew gelman stats-2013-04-10-“Proposition and experiment”
Introduction: Anna Lena Phillips writes : I. Many people will not, of their own accord, look at a poem. II. Millions of people will, of their own accord, spend lots and lots of time looking at photographs of cats. III. Therefore, earlier this year, I concluded that the best strategy for increasing the number of viewers for poems would be to print them on top of photographs of cats. IV. I happen to like looking at both poems and cats. V. So this is, for me, a win-win situation. VI. Fortunately, my own cat is a patient model, and (if I am to be believed) quite photogenic. VII. The aforementioned cat is Tisko Tansi, small hero. VII. Thus I present to you (albeit in digital rather than physical form) an Endearments broadside, featuring a poem that originally appeared in BlazeVOX spring 2011. VIII. If you want to share a copy of this image, please ask first. If you want a real copy, you can ask about that too. She follows up with an image of a cat, on which is superimposed a short
4 0.78126079 1476 andrew gelman stats-2012-08-30-Stan is fast
Introduction: 10,000 iterations for 4 chains on the (precompiled) efficiently-parameterized 8-schools model: > date () [1] "Thu Aug 30 22:12:53 2012" > fit3 <- stan (fit=fit2, data = schools_dat, iter = 1e4, n_chains = 4) SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1). Iteration: 10000 / 10000 [100%] (Sampling) SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2). Iteration: 10000 / 10000 [100%] (Sampling) SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3). Iteration: 10000 / 10000 [100%] (Sampling) SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4). Iteration: 10000 / 10000 [100%] (Sampling) > date () [1] "Thu Aug 30 22:12:55 2012" > print (fit3) Inference for Stan model: anon_model. 4 chains: each with iter=10000; warmup=5000; thin=1; 10000 iterations saved. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat mu 8.0 0.1 5.1 -2.0 4.7 8.0 11.3 18.4 4032 1 tau 6.7 0.1 5.6 0.3 2.5 5.4 9.3 21.2 2958 1 eta[1] 0.4 0.0 0.9 -1.5 -0
5 0.75987345 1478 andrew gelman stats-2012-08-31-Watercolor regression
Introduction: Solomon Hsiang writes: Two small follow-ups based on the discussion (the second/bigger one is to address your comment about the 95% CI edges). 1. I realized that if we plot the confidence intervals as a solid color that fades (eg. using the “fixed ink” scheme from before) we can make sure the regression line also has heightened visual weight where confidence is high by plotting the line white. This makes the contrast (and thus visual weight) between the regression line and the CI highest when the CI is narrow and dark. As the CI fade near the edges, so does the contrast with the regression line. This is a small adjustment, but I like it because it is so simple and it makes the graph much nicer. (see “visually_weighted_fill_reverse” attached). My posted code has been updated to do this automatically. 2. You and your readers didn’t like that the edges of the filled CI were so sharp and arbitrary. But I didn’t like that the contrast between the spaghetti lines and the background
6 0.73707092 551 andrew gelman stats-2011-02-02-Obama and Reagan, sitting in a tree, etc.
7 0.68581367 370 andrew gelman stats-2010-10-25-Who gets wedding announcements in the Times?
8 0.66970772 55 andrew gelman stats-2010-05-27-In Linux, use jags() to call Jags instead of using bugs() to call OpenBugs
9 0.66679358 1470 andrew gelman stats-2012-08-26-Graphs showing regression uncertainty: the code!
10 0.65702438 883 andrew gelman stats-2011-09-01-Arrow’s theorem update
11 0.65135443 101 andrew gelman stats-2010-06-20-“People with an itch to scratch”
12 0.64258265 1847 andrew gelman stats-2013-05-08-Of parsing and chess
13 0.63107359 2105 andrew gelman stats-2013-11-18-What’s my Kasparov number?
14 0.62990069 1217 andrew gelman stats-2012-03-17-NSF program “to support analytic and methodological research in support of its surveys”
16 0.60645401 1898 andrew gelman stats-2013-06-14-Progress! (on the understanding of the role of randomization in Bayesian inference)
17 0.57803941 1666 andrew gelman stats-2013-01-10-They’d rather be rigorous than right
18 0.57301694 2318 andrew gelman stats-2014-05-04-Stan (& JAGS) Tutorial on Linear Mixed Models
19 0.5571121 998 andrew gelman stats-2011-11-08-Bayes-Godel
20 0.552688 818 andrew gelman stats-2011-07-23-Parallel JAGS RNGs