andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-910 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Tools worth knowing about: Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [I might write this example up later.] Go watch the screencast videos for Refine. Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. Refine certainly would help find duplicates, and it counts things for you too. Just counting rows is too much for researchers sometimes (see yesterday’s post )! Refine 2.0 adds some data-collection tools for
sentIndex sentText sentNum sentScore
1 Tools worth knowing about: Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. [sent-1, score-1.047]
2 A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [sent-2, score-0.36]
3 Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. [sent-5, score-0.972]
4 Refine certainly would help find duplicates, and it counts things for you too. [sent-6, score-0.176]
5 Just counting rows is too much for researchers sometimes (see yesterday’s post )! [sent-7, score-0.191]
6 0 adds some data-collection tools for scraping and parsing web data. [sent-9, score-0.688]
7 I have not had a chance to play with any of this kind of advanced scripting with it yet. [sent-10, score-0.288]
8 I also have not had occasion to use Freebase which seems sort of similar (in that it is mostly open data with web APIs) to infochimps (for more on this, see the infochimps R package by Drew Conway ). [sent-11, score-0.927]
wordName wordTfidf (topN-words)
[('refine', 0.483), ('duplicates', 0.305), ('infochimps', 0.25), ('web', 0.199), ('mixed', 0.157), ('scraping', 0.139), ('anes', 0.139), ('rampant', 0.139), ('apis', 0.139), ('js', 0.139), ('scripting', 0.139), ('tools', 0.137), ('freebase', 0.131), ('parsing', 0.131), ('leading', 0.128), ('conway', 0.125), ('databases', 0.114), ('videos', 0.114), ('spaces', 0.109), ('transformations', 0.107), ('transforming', 0.105), ('messy', 0.104), ('browser', 0.104), ('rows', 0.104), ('cumulative', 0.101), ('extending', 0.101), ('drew', 0.101), ('cleaning', 0.099), ('units', 0.092), ('format', 0.092), ('occasion', 0.092), ('help', 0.09), ('watch', 0.089), ('services', 0.088), ('linking', 0.088), ('counting', 0.087), ('subset', 0.086), ('counts', 0.086), ('file', 0.083), ('advanced', 0.083), ('adds', 0.082), ('bigger', 0.08), ('knowing', 0.074), ('package', 0.073), ('tool', 0.071), ('column', 0.07), ('yesterday', 0.067), ('play', 0.066), ('false', 0.064), ('mostly', 0.063)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999982 910 andrew gelman stats-2011-09-15-Google Refine
Introduction: Tools worth knowing about: Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [I might write this example up later.] Go watch the screencast videos for Refine. Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. Refine certainly would help find duplicates, and it counts things for you too. Just counting rows is too much for researchers sometimes (see yesterday’s post )! Refine 2.0 adds some data-collection tools for
2 0.15927851 192 andrew gelman stats-2010-08-08-Turning pages into data
Introduction: There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically? The technique to turn web pages intended for people into structured data sets intended for computers is called “screen scraping.” It has just been made easier with a wiki/community http://scraperwiki.com/ . They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It’s very good. Here’s an example of scraping road accident data or port of London ship arrivals . You can already find collections of structured data online, examples are Infochimps (“find the world’s data”), and Freebase (“An entity graph of people, places and things, built by a community that loves open data.”). There’s also a repository system for data, TheData (“An open-source application for pub
3 0.10976905 911 andrew gelman stats-2011-09-15-More data tools worth using from Google
Introduction: Speaking of open data and google tools, see this post from Revolution R: How to use a Google Spreadsheet as data in R .
4 0.097826928 424 andrew gelman stats-2010-11-21-Data cleaning tool!
Introduction: Hal Varian writes: You might find this a useful tool for cleaning data. I haven’t tried it out yet, but data cleaning is a hugely important topic and so this could be a big deal.
5 0.091954179 1175 andrew gelman stats-2012-02-19-Factual – a new place to find data
Introduction: Factual collects data on a variety of topics, organizes them, and allows easy access. If you ever wanted to do a histogram of calorie content in Starbucks coffees or plot warnings with a live feed of earthquake data – your life should be a bit simpler now. Also see DataMarket , InfoChimps , and a few older links in The Future of Data Analysis . If you access the data through the API, you can build live visualizations like this: Of course, you could just go to the source. Roy Mendelssohn writes (with minor edits): Since you are both interested in data access, please look at our service ERDDAP: http://coastwatch.pfel.noaa.gov/erddap/index.html http://upwell.pfeg.noaa.gov/erddap/index.html Please do not be fooled by the web pages. Everything is a service (including search and graphics) and the URL completely defines the request, and response formats are easily changed just by changing the “file extension”. The web pages are just html and javascript that u
6 0.080437258 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?
7 0.078025773 574 andrew gelman stats-2011-02-14-“The best data visualizations should stand on their own”? I don’t think so.
8 0.072270386 1754 andrew gelman stats-2013-03-08-Cool GSS training video! And cumulative file 1972-2012!
9 0.069923468 165 andrew gelman stats-2010-07-27-Nothing is Linear, Nothing is Additive: Bayesian Models for Interactions in Social Science
10 0.068063095 1682 andrew gelman stats-2013-01-19-R package for Bayes factors
11 0.067041807 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?
12 0.06404765 914 andrew gelman stats-2011-09-16-meta-infographic
14 0.059634946 1134 andrew gelman stats-2012-01-21-Lessons learned from a recent R package submission
15 0.05962868 569 andrew gelman stats-2011-02-12-Get the Data
16 0.058161944 1990 andrew gelman stats-2013-08-20-Job opening at an organization that promotes reproducible research!
17 0.057261504 2254 andrew gelman stats-2014-03-18-Those wacky anti-Bayesians used to be intimidating, but now they’re just pathetic
18 0.056515429 2303 andrew gelman stats-2014-04-23-Thinking of doing a list experiment? Here’s a list of reasons why you should think again
19 0.056272507 927 andrew gelman stats-2011-09-26-R and Google Visualization
20 0.055272885 215 andrew gelman stats-2010-08-18-DataMarket
topicId topicWeight
[(0, 0.096), (1, -0.016), (2, -0.023), (3, 0.004), (4, 0.039), (5, -0.006), (6, -0.003), (7, -0.037), (8, 0.013), (9, -0.008), (10, -0.014), (11, -0.01), (12, 0.012), (13, -0.005), (14, -0.007), (15, 0.029), (16, -0.012), (17, -0.021), (18, -0.006), (19, 0.008), (20, 0.018), (21, 0.009), (22, -0.012), (23, 0.007), (24, -0.044), (25, 0.006), (26, 0.013), (27, 0.026), (28, -0.015), (29, 0.007), (30, 0.026), (31, -0.014), (32, 0.024), (33, -0.018), (34, 0.011), (35, 0.008), (36, -0.006), (37, 0.014), (38, -0.016), (39, 0.024), (40, 0.016), (41, 0.016), (42, 0.013), (43, 0.029), (44, -0.011), (45, 0.047), (46, 0.01), (47, -0.022), (48, -0.003), (49, -0.057)]
simIndex simValue blogId blogTitle
same-blog 1 0.94262141 910 andrew gelman stats-2011-09-15-Google Refine
Introduction: Tools worth knowing about: Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [I might write this example up later.] Go watch the screencast videos for Refine. Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. Refine certainly would help find duplicates, and it counts things for you too. Just counting rows is too much for researchers sometimes (see yesterday’s post )! Refine 2.0 adds some data-collection tools for
2 0.75597554 724 andrew gelman stats-2011-05-21-New search engine for data & statistics
Introduction: Jon Goldhill points us to a new search engine, Zanran , which is for finding data and statistics. Goldhill writes: It’s useful when you’re looking for a graph/table rather than a single number. For example, if you look for ‘teenage births rates in the united states’ in Zanran you’ll see a series of graphs. If you check in Google, there’s plenty of material – but you’d have to open everything up to see if it had any real numbers. (I hope you’ll appreciate Zanran’s preview capability as well – hovering over the icons gives a useful preview of the content.)
3 0.73479104 1530 andrew gelman stats-2012-10-11-Migrating your blog from Movable Type to WordPress
Introduction: Cord Blomquist, who did a great job moving us from horrible Movable Type to nice nice WordPress, writes: I [Cord] wanted to share a little news with you related to the original work we did for you last year. When ReadyMadeWeb converted your Movable Type blog to WordPress, we got a lot of other requestes for the same service, so we started thinking about a bigger market for such a product. After a bit of research, we started work on automating the data conversion, writing rules, and exceptions to the rules, on how Movable Type and TypePad data could be translated to WordPress. After many months of work, we’re getting ready to announce TP2WP.com , a service that converts Movable Type and TypePad export files to WordPress import files, so anyone who wants to migrate to WordPress can do so easily and without losing permalinks, comments, images, or other files. By automating our service, we’ve been able to drop the price to just $99. I recommend it (and, no, Cord is not paying m
4 0.72733688 597 andrew gelman stats-2011-03-02-RStudio – new cross-platform IDE for R
Introduction: The new R environment RStudio looks really great, especially for users new to R. In teaching, these are often people new to programming anything, much less statistical models. The R GUIs were different on each platform, with (sometimes modal) windows appearing and disappearing and no unified design. RStudio fixes that and has already found a happy home on my desktop. Initial impressions I’ve been using it for the past couple of days. For me, it replaces the niche that R.app held: looking at help, quickly doing something I don’t want to pollute a project workspace with; sometimes data munging, merging, and transforming; and prototyping plots. RStudio is better than R.app at all of these things. For actual development and papers, though, I remain wedded to emacs+ess (good old C-x M-c M-Butterfly ). Favorite features in no particular order plots seamlessly made in new graphics devices. This is huge— instead of one active plot window named something like quartz(1) t
Introduction: David Karger writes: Your recent post on sharing data was of great interest to me, as my own research in computer science asks how to incentivize and lower barriers to data sharing. I was particularly curious about your highlighting of effort as the major dis-incentive to sharing. I would love to hear more, as this question of effort is on we specifically target in our development of tools for data authoring and publishing. As a straw man, let me point out that sharing data technically requires no more than posting an excel spreadsheet online. And that you likely already produced that spreadsheet during your own analytic work. So, in what way does such low-tech publishing fail to meet your data sharing objectives? Our own hypothesis has been that the effort is really quite low, with the problem being a lack of *immediate/tangible* benefits (as opposed to the long-term values you accurately describe). To attack this problem, we’re developing tools (and, since it appear
6 0.71943969 192 andrew gelman stats-2010-08-08-Turning pages into data
7 0.71506608 911 andrew gelman stats-2011-09-15-More data tools worth using from Google
8 0.71483117 1127 andrew gelman stats-2012-01-18-The Fixie Bike Index
9 0.70534241 1434 andrew gelman stats-2012-07-29-FindTheData.org
10 0.70455194 1808 andrew gelman stats-2013-04-17-Excel-bashing
11 0.7020191 927 andrew gelman stats-2011-09-26-R and Google Visualization
12 0.69766355 1907 andrew gelman stats-2013-06-20-Amazing retro gnu graphics!
13 0.69580406 272 andrew gelman stats-2010-09-13-Ross Ihaka to R: Drop Dead
14 0.69046861 1807 andrew gelman stats-2013-04-17-Data problems, coding errors…what can be done?
15 0.69010466 1175 andrew gelman stats-2012-02-19-Factual – a new place to find data
16 0.68794471 1596 andrew gelman stats-2012-11-29-More consulting experiences, this time in computational linguistics
17 0.6873641 1559 andrew gelman stats-2012-11-02-The blog is back
19 0.67675859 752 andrew gelman stats-2011-06-08-Traffic Prediction
20 0.67375493 2059 andrew gelman stats-2013-10-12-Visualization, “big data”, and EDA
topicId topicWeight
[(8, 0.031), (13, 0.013), (16, 0.03), (20, 0.239), (24, 0.028), (32, 0.012), (40, 0.02), (43, 0.023), (45, 0.053), (47, 0.03), (49, 0.013), (58, 0.016), (76, 0.013), (85, 0.012), (86, 0.066), (88, 0.013), (89, 0.049), (99, 0.242)]
simIndex simValue blogId blogTitle
same-blog 1 0.88724881 910 andrew gelman stats-2011-09-15-Google Refine
Introduction: Tools worth knowing about: Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [I might write this example up later.] Go watch the screencast videos for Refine. Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. Refine certainly would help find duplicates, and it counts things for you too. Just counting rows is too much for researchers sometimes (see yesterday’s post )! Refine 2.0 adds some data-collection tools for
2 0.87155312 479 andrew gelman stats-2010-12-20-WWJD? U can find out!
Introduction: Two positions open in the statistics group at the NYU education school. If you get the job, you get to work with Jennifer HIll! One position is a postdoctoral fellowship, and the other is a visiting professorship. The latter position requires “the demonstrated ability to develop a nationally recognized research program,” which seems like a lot to ask for a visiting professor. Do they expect the visiting prof to develop a nationally recognized research program and then leave it there at NYU after the visit is over? In any case, Jennifer and her colleagues are doing excellent work, both applied and methodological, and this seems like a great opportunity.
3 0.83802891 2108 andrew gelman stats-2013-11-20-That’s crazy talk!
Introduction: Tenure track faculty opening at the Center for the Promotion of Research Involving Innovative Statistical Methodology, with Jennifer Hill, Marc Scott, and other world-class researchers. It looks like a great opportunity.
4 0.83684731 480 andrew gelman stats-2010-12-21-Instead of “confidence interval,” let’s say “uncertainty interval”
Introduction: I’ve become increasingly uncomfortable with the term “confidence interval,” for several reasons: - The well-known difficulties in interpretation (officially the confidence statement can be interpreted only on average, but people typically implicitly give the Bayesian interpretation to each case), - The ambiguity between confidence intervals and predictive intervals. (See the footnote in BDA where we discuss the difference between “inference” and “prediction” in the classical framework.) - The awkwardness of explaining that confidence intervals are big in noisy situations where you have less confidence, and confidence intervals are small when you have more confidence. So here’s my proposal. Let’s use the term “uncertainty interval” instead. The uncertainty interval tells you how much uncertainty you have. That works pretty well, I think. P.S. As of this writing, “confidence interval” outGoogles “uncertainty interval” by the huge margin of 9.5 million to 54000. So we
5 0.80506366 1937 andrew gelman stats-2013-07-13-Meritocracy rerun
Introduction: I’ve said it here so often, this time I put it on the sister blog. . . .
7 0.781111 1647 andrew gelman stats-2013-01-01-Neoconservatism circa 1986
8 0.77166182 661 andrew gelman stats-2011-04-14-NYC 1950
9 0.76522505 900 andrew gelman stats-2011-09-11-Symptomatic innumeracy
10 0.7599684 831 andrew gelman stats-2011-07-30-A Wikipedia riddle!
11 0.75825071 1270 andrew gelman stats-2012-04-19-Demystifying Blup
12 0.75380802 1782 andrew gelman stats-2013-03-30-“Statistical Modeling: A Fresh Approach”
13 0.75301456 194 andrew gelman stats-2010-08-09-Data Visualization
14 0.75279707 974 andrew gelman stats-2011-10-26-NYC jobs in applied statistics, psychometrics, and causal inference!
15 0.74935478 1629 andrew gelman stats-2012-12-18-It happened in Connecticut
16 0.74122751 270 andrew gelman stats-2010-09-12-Comparison of forecasts for the 2010 congressional elections
17 0.74116743 254 andrew gelman stats-2010-09-04-Bayesian inference viewed as a computational approximation to classical calculations
18 0.73652577 592 andrew gelman stats-2011-02-26-“Do you need ideal conditions to do great work?”
19 0.72945106 1652 andrew gelman stats-2013-01-03-“The Case for Inductive Theory Building”