andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-192 knowledge-graph by maker-knowledge-mining

192 andrew gelman stats-2010-08-08-Turning pages into data


meta infos for this blog

Source: html

Introduction: There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically? The technique to turn web pages intended for people into structured data sets intended for computers is called “screen scraping.” It has just been made easier with a wiki/community http://scraperwiki.com/ . They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It’s very good. Here’s an example of scraping road accident data or port of London ship arrivals . You can already find collections of structured data online, examples are Infochimps (“find the world’s data”), and Freebase (“An entity graph of people, places and things, built by a community that loves open data.”). There’s also a repository system for data, TheData (“An open-source application for pub


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically? [sent-1, score-0.675]

2 The technique to turn web pages intended for people into structured data sets intended for computers is called “screen scraping. [sent-2, score-1.321]

3 They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. [sent-5, score-0.525]

4 Moreover, the community aspect of it should allow researchers doing similar things to get connected. [sent-6, score-0.24]

5 Here’s an example of scraping road accident data or port of London ship arrivals . [sent-8, score-0.845]

6 You can already find collections of structured data online, examples are Infochimps (“find the world’s data”), and Freebase (“An entity graph of people, places and things, built by a community that loves open data. [sent-9, score-0.996]

7 There’s also a repository system for data, TheData (“An open-source application for publishing, citing and discovering research data”). [sent-11, score-0.374]

8 The challenge is how to keep these efforts alive and active. [sent-12, score-0.279]

9 One early company helping people screen-scrape was Dapper that’s now helping retailers advertise by scraping their own websites. [sent-13, score-1.097]

10 Perhaps the library funding should be used towards tools like that rather than piling up physical copies of expensive journals everyone reads just online. [sent-14, score-0.748]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('scraping', 0.312), ('structured', 0.203), ('helping', 0.193), ('intended', 0.176), ('community', 0.158), ('retailers', 0.156), ('piling', 0.156), ('port', 0.156), ('repository', 0.156), ('web', 0.15), ('data', 0.148), ('freebase', 0.147), ('turn', 0.143), ('entity', 0.141), ('infochimps', 0.141), ('advertise', 0.136), ('collections', 0.132), ('libraries', 0.126), ('loves', 0.123), ('computers', 0.117), ('accident', 0.117), ('alive', 0.113), ('road', 0.112), ('extract', 0.112), ('discovering', 0.11), ('spreadsheet', 0.11), ('fill', 0.109), ('copies', 0.109), ('files', 0.108), ('citing', 0.108), ('people', 0.107), ('excel', 0.107), ('moreover', 0.105), ('pdf', 0.105), ('london', 0.104), ('screen', 0.103), ('technique', 0.101), ('towards', 0.1), ('reads', 0.099), ('library', 0.098), ('funding', 0.096), ('built', 0.091), ('expensive', 0.09), ('forms', 0.09), ('automatically', 0.088), ('analyze', 0.086), ('efforts', 0.084), ('aspect', 0.082), ('challenge', 0.082), ('meant', 0.081)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999982 192 andrew gelman stats-2010-08-08-Turning pages into data

Introduction: There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically? The technique to turn web pages intended for people into structured data sets intended for computers is called “screen scraping.” It has just been made easier with a wiki/community http://scraperwiki.com/ . They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It’s very good. Here’s an example of scraping road accident data or port of London ship arrivals . You can already find collections of structured data online, examples are Infochimps (“find the world’s data”), and Freebase (“An entity graph of people, places and things, built by a community that loves open data.”). There’s also a repository system for data, TheData (“An open-source application for pub

2 0.15927851 910 andrew gelman stats-2011-09-15-Google Refine

Introduction: Tools worth knowing about: Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [I might write this example up later.] Go watch the screencast videos for Refine. Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. Refine certainly would help find duplicates, and it counts things for you too. Just counting rows is too much for researchers sometimes (see yesterday’s post )! Refine 2.0 adds some data-collection tools for

3 0.14981672 1853 andrew gelman stats-2013-05-12-OpenData Latinoamerica

Introduction: Miguel Paz writes : Poderomedia Foundation and PinLatam are launching OpenDataLatinoamerica.org, a regional data repository to free data and use it on Hackathons and other activities by HacksHackers chapters and other organizations. We are doing this because the road to the future of news has been littered with lost datasets. A day or so after every hackathon and meeting where a group has come together to analyze, compare and understand a particular set of data, someone tries to remember where the successful files were stored. Too often, no one is certain. Therefore with Mariano Blejman we realized that we need a central repository where you can share the data that you have proved to be reliable: OpenData Latinoamerica, which we are leading as ICFJ Knight International Journalism Fellows. If you work in Latin America or Central America your organization can take part in OpenDataLatinoamerica.org. To apply, go to the website and answer a simple form agreeing to meet the standard

4 0.13444602 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?

Introduction: David Karger writes: Your recent post on sharing data was of great interest to me, as my own research in computer science asks how to incentivize and lower barriers to data sharing. I was particularly curious about your highlighting of effort as the major dis-incentive to sharing. I would love to hear more, as this question of effort is on we specifically target in our development of tools for data authoring and publishing. As a straw man, let me point out that sharing data technically requires no more than posting an excel spreadsheet online. And that you likely already produced that spreadsheet during your own analytic work. So, in what way does such low-tech publishing fail to meet your data sharing objectives? Our own hypothesis has been that the effort is really quite low, with the problem being a lack of *immediate/tangible* benefits (as opposed to the long-term values you accurately describe). To attack this problem, we’re developing tools (and, since it appear

5 0.10975315 911 andrew gelman stats-2011-09-15-More data tools worth using from Google

Introduction: Speaking of open data and google tools, see this post from Revolution R: How to use a Google Spreadsheet as data in R .

6 0.10841244 1175 andrew gelman stats-2012-02-19-Factual – a new place to find data

7 0.10740842 839 andrew gelman stats-2011-08-04-To commenters who are trying to sell something

8 0.097875148 569 andrew gelman stats-2011-02-12-Get the Data

9 0.08596278 2345 andrew gelman stats-2014-05-24-An interesting mosaic of a data programming course

10 0.076135486 946 andrew gelman stats-2011-10-07-Analysis of Power Law of Participation

11 0.071414679 878 andrew gelman stats-2011-08-29-Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go

12 0.071334504 1948 andrew gelman stats-2013-07-21-Bayes related

13 0.070335835 1808 andrew gelman stats-2013-04-17-Excel-bashing

14 0.069264151 1990 andrew gelman stats-2013-08-20-Job opening at an organization that promotes reproducible research!

15 0.068758115 1661 andrew gelman stats-2013-01-08-Software is as software does

16 0.068594493 1431 andrew gelman stats-2012-07-27-Overfitting

17 0.067229114 2168 andrew gelman stats-2014-01-12-Things that I like that almost nobody else is interested in

18 0.067048594 1435 andrew gelman stats-2012-07-30-Retracted articles and unethical behavior in economics journals?

19 0.066447504 223 andrew gelman stats-2010-08-21-Statoverflow

20 0.066317223 1920 andrew gelman stats-2013-06-30-“Non-statistical” statistics tools


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.127), (1, -0.029), (2, -0.05), (3, -0.002), (4, 0.049), (5, -0.01), (6, -0.035), (7, -0.033), (8, -0.027), (9, 0.014), (10, -0.011), (11, -0.02), (12, -0.01), (13, -0.005), (14, -0.021), (15, 0.03), (16, 0.034), (17, -0.028), (18, 0.03), (19, -0.01), (20, 0.032), (21, 0.032), (22, -0.023), (23, -0.005), (24, -0.055), (25, -0.003), (26, 0.054), (27, -0.006), (28, 0.027), (29, 0.038), (30, 0.006), (31, -0.035), (32, 0.008), (33, 0.019), (34, 0.012), (35, 0.052), (36, -0.021), (37, 0.021), (38, -0.017), (39, 0.046), (40, 0.021), (41, 0.016), (42, 0.002), (43, 0.056), (44, -0.049), (45, 0.049), (46, 0.042), (47, -0.041), (48, 0.002), (49, -0.038)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95803511 192 andrew gelman stats-2010-08-08-Turning pages into data

Introduction: There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically? The technique to turn web pages intended for people into structured data sets intended for computers is called “screen scraping.” It has just been made easier with a wiki/community http://scraperwiki.com/ . They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It’s very good. Here’s an example of scraping road accident data or port of London ship arrivals . You can already find collections of structured data online, examples are Infochimps (“find the world’s data”), and Freebase (“An entity graph of people, places and things, built by a community that loves open data.”). There’s also a repository system for data, TheData (“An open-source application for pub

2 0.84796762 1175 andrew gelman stats-2012-02-19-Factual – a new place to find data

Introduction: Factual collects data on a variety of topics, organizes them, and allows easy access. If you ever wanted to do a histogram of calorie content in Starbucks coffees or plot warnings with a live feed of earthquake data – your life should be a bit simpler now. Also see DataMarket , InfoChimps , and a few older links in The Future of Data Analysis . If you access the data through the API, you can build live visualizations like this: Of course, you could just go to the source. Roy Mendelssohn writes (with minor edits): Since you are both interested in data access, please look at our service ERDDAP: http://coastwatch.pfel.noaa.gov/erddap/index.html http://upwell.pfeg.noaa.gov/erddap/index.html Please do not be fooled by the web pages. Everything is a service (including search and graphics) and the URL completely defines the request, and response formats are easily changed just by changing the “file extension”. The web pages are just html and javascript that u

3 0.84671533 1853 andrew gelman stats-2013-05-12-OpenData Latinoamerica

Introduction: Miguel Paz writes : Poderomedia Foundation and PinLatam are launching OpenDataLatinoamerica.org, a regional data repository to free data and use it on Hackathons and other activities by HacksHackers chapters and other organizations. We are doing this because the road to the future of news has been littered with lost datasets. A day or so after every hackathon and meeting where a group has come together to analyze, compare and understand a particular set of data, someone tries to remember where the successful files were stored. Too often, no one is certain. Therefore with Mariano Blejman we realized that we need a central repository where you can share the data that you have proved to be reliable: OpenData Latinoamerica, which we are leading as ICFJ Knight International Journalism Fellows. If you work in Latin America or Central America your organization can take part in OpenDataLatinoamerica.org. To apply, go to the website and answer a simple form agreeing to meet the standard

4 0.83960736 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?

Introduction: David Karger writes: Your recent post on sharing data was of great interest to me, as my own research in computer science asks how to incentivize and lower barriers to data sharing. I was particularly curious about your highlighting of effort as the major dis-incentive to sharing. I would love to hear more, as this question of effort is on we specifically target in our development of tools for data authoring and publishing. As a straw man, let me point out that sharing data technically requires no more than posting an excel spreadsheet online. And that you likely already produced that spreadsheet during your own analytic work. So, in what way does such low-tech publishing fail to meet your data sharing objectives? Our own hypothesis has been that the effort is really quite low, with the problem being a lack of *immediate/tangible* benefits (as opposed to the long-term values you accurately describe). To attack this problem, we’re developing tools (and, since it appear

5 0.83079773 714 andrew gelman stats-2011-05-16-NYT Labs releases Openpaths, a utility for saving your iphone data

Introduction: Jake Porway writes: We launched Openpaths the other week. It’s a site where people can privately upload and view their iPhone location data (at least until an Apple update wipes it out) and also download their data for their own use. More than just giving people a neat tool to view their data with, however, we’re also creating an option for them to donate their data to research projects at varying levels of anonymity. We’re still working out the terms for that, but we’d love any input and to get in touch with anyone who might want to use the data. I don’t have any use for this personally but maybe it will interest some of you. From the webpage: Openpaths is an anonymous, user-contributed database for the personal location data files recorded by iOS devices. Users securely store, explore, and manage their personal location data, and grant researchers access to portions of that data as they choose. All location data stored in openpaths is kept separate from user profi

6 0.80827099 1212 andrew gelman stats-2012-03-14-Controversy about a ranking of philosophy departments, or How should we think about statistical results when we can’t see the raw data?

7 0.80475664 41 andrew gelman stats-2010-05-19-Updated R code and data for ARM

8 0.79309601 2307 andrew gelman stats-2014-04-27-Big Data…Big Deal? Maybe, if Used with Caution.

9 0.77805853 910 andrew gelman stats-2011-09-15-Google Refine

10 0.77782524 911 andrew gelman stats-2011-09-15-More data tools worth using from Google

11 0.77403486 1920 andrew gelman stats-2013-06-30-“Non-statistical” statistics tools

12 0.77105385 1837 andrew gelman stats-2013-05-03-NYC Data Skeptics Meetup

13 0.769517 1530 andrew gelman stats-2012-10-11-Migrating your blog from Movable Type to WordPress

14 0.76836509 1434 andrew gelman stats-2012-07-29-FindTheData.org

15 0.76701432 569 andrew gelman stats-2011-02-12-Get the Data

16 0.76562768 752 andrew gelman stats-2011-06-08-Traffic Prediction

17 0.76400042 2345 andrew gelman stats-2014-05-24-An interesting mosaic of a data programming course

18 0.76164401 1990 andrew gelman stats-2013-08-20-Job opening at an organization that promotes reproducible research!

19 0.75899482 951 andrew gelman stats-2011-10-11-Data mining efforts for Obama’s campaign

20 0.74445057 946 andrew gelman stats-2011-10-07-Analysis of Power Law of Participation


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.013), (6, 0.024), (15, 0.067), (16, 0.073), (24, 0.053), (27, 0.054), (45, 0.264), (54, 0.015), (55, 0.015), (73, 0.014), (86, 0.034), (89, 0.011), (95, 0.013), (99, 0.255)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.94186389 543 andrew gelman stats-2011-01-28-NYT shills for personal DNA tests

Introduction: Kaiser nails it . The offending article , by John Tierney, somehow ended up in the Science section rather than the Opinion section. As an opinion piece (or, for that matter, a blog), Tierney’s article would be nothing special. But I agree with Kaiser that it doesn’t work as a newspaper article. As Kaiser notes, this story involves a bunch of statistical and empirical claims that are not well resolved by P.R. and rhetoric.

2 0.93097115 1407 andrew gelman stats-2012-07-06-Statistical inference and the secret ballot

Introduction: Ring Lardner, Jr.: [In 1936] I was already settled in Southern California, and it may have been that first exercise of the franchise that triggered the FBI surveillance of me that would last for decades. I had assumed, of course, that I was enjoying the vaunted American privilege of the secret ballot. On a wall outside my polling place on Wilshire Boulevard, however, was a compilation of the district’s registered voters: Democrats, a long list of names; Republicans, a somewhat lesser number; and “Declines to State,” one, “Ring W. Lardner, Jr.” The day after the election, alongside those lists were published the results: Roosevelt, so many; Landon, so many; Browder, one.

3 0.90828347 999 andrew gelman stats-2011-11-09-I was at a meeting a couple months ago . . .

Introduction: . . . and I decided to amuse myself by writing down all the management-speak words I heard: “grappling” “early prototypes” “technology platform” “building block” “machine learning” “your team” “workspace” “tagging” “data exhaust” “monitoring a particular population” “collective intelligence” “communities of practice” “hackathon” “human resources . . . technologies” Any one or two or three of these phrases might be fine, but put them all together and what you have is a festival of jargon. A hackathon, indeed.

same-blog 4 0.8782326 192 andrew gelman stats-2010-08-08-Turning pages into data

Introduction: There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically? The technique to turn web pages intended for people into structured data sets intended for computers is called “screen scraping.” It has just been made easier with a wiki/community http://scraperwiki.com/ . They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It’s very good. Here’s an example of scraping road accident data or port of London ship arrivals . You can already find collections of structured data online, examples are Infochimps (“find the world’s data”), and Freebase (“An entity graph of people, places and things, built by a community that loves open data.”). There’s also a repository system for data, TheData (“An open-source application for pub

5 0.873034 1031 andrew gelman stats-2011-11-27-Richard Stallman and John McCarthy

Introduction: After blogging on quirky software pioneer Richard Stallman , I thought it appropriate to write something about recently deceased quirky software pioneer John McCarthy, who, with the exception of being bearded, seems like he was the personal and political opposite of Stallman. Here’s a page I found of Stallman McCarthy quotes (compiled by Neil Craig). It’s a mixture of the reasonable and the unreasonable (ok, I suppose the same could be said of this blog!). I wonder if he and Stallman ever met and, if so, whether they had an extended conversation. It would be like matter and anti-matter! P.S. I flipped through McCarthy’s pages and found one of my pet peeves. See item 3 here , which sounds so plausible but is in fact not true (at least, not according to the National Election Study). As McCarthy’s Stanford colleague Mo Fiorina can tell you, otherwise well-informed people believe all sorts of things about polarization that aren’t so. Labeling groups of Americans as “

6 0.86734354 1325 andrew gelman stats-2012-05-17-More on the difficulty of “preaching what you practice”

7 0.85703421 206 andrew gelman stats-2010-08-13-Indiemapper makes thematic mapping easy

8 0.85691416 69 andrew gelman stats-2010-06-04-A Wikipedia whitewash

9 0.85496449 1504 andrew gelman stats-2012-09-20-Could someone please lock this guy and Niall Ferguson in a room together?

10 0.84908849 791 andrew gelman stats-2011-07-08-Censoring on one end, “outliers” on the other, what can we do with the middle?

11 0.84505385 673 andrew gelman stats-2011-04-20-Upper-income people still don’t realize they’re upper-income

12 0.83951789 362 andrew gelman stats-2010-10-22-A redrawing of the Red-Blue map in November 2010?

13 0.83716309 310 andrew gelman stats-2010-10-02-The winner’s curse

14 0.82620591 1015 andrew gelman stats-2011-11-17-Good examples of lurking variables?

15 0.82444811 1012 andrew gelman stats-2011-11-16-Blog bribes!

16 0.81525105 735 andrew gelman stats-2011-05-28-New app for learning intro statistics

17 0.81431627 2189 andrew gelman stats-2014-01-28-History is too important to be left to the history professors

18 0.81171679 573 andrew gelman stats-2011-02-14-Hipmunk < Expedia, again

19 0.80938089 1767 andrew gelman stats-2013-03-17-The disappearing or non-disappearing middle class

20 0.8089872 728 andrew gelman stats-2011-05-24-A (not quite) grand unified theory of plagiarism, as applied to the Wegman case