andrew_gelman_stats andrew_gelman_stats-2010 andrew_gelman_stats-2010-424 knowledge-graph by maker-knowledge-mining

424 andrew gelman stats-2010-11-21-Data cleaning tool!


meta infos for this blog

Source: html

Introduction: Hal Varian writes: You might find this a useful tool for cleaning data. I haven’t tried it out yet, but data cleaning is a hugely important topic and so this could be a big deal.


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Hal Varian writes: You might find this a useful tool for cleaning data. [sent-1, score-1.163]

2 I haven’t tried it out yet, but data cleaning is a hugely important topic and so this could be a big deal. [sent-2, score-1.607]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('cleaning', 0.62), ('varian', 0.391), ('hal', 0.314), ('hugely', 0.31), ('tool', 0.223), ('tried', 0.186), ('deal', 0.184), ('haven', 0.172), ('yet', 0.162), ('topic', 0.144), ('useful', 0.138), ('big', 0.115), ('important', 0.11), ('find', 0.105), ('might', 0.077), ('writes', 0.065), ('could', 0.063), ('data', 0.059)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 424 andrew gelman stats-2010-11-21-Data cleaning tool!

Introduction: Hal Varian writes: You might find this a useful tool for cleaning data. I haven’t tried it out yet, but data cleaning is a hugely important topic and so this could be a big deal.

2 0.250476 450 andrew gelman stats-2010-12-04-The Joy of Stats

Introduction: Hal Varian sends in this link to a series of educational videos described to be “a journey into the heart of statistics.” It seems to be focused on exploratory data analysis, which it describes as “an extraordinary new method of understanding ourselves and our Universe.”

3 0.23794456 817 andrew gelman stats-2011-07-23-New blog home

Introduction: Hi all. We’ve moved the blog and are still working out some bugs. For example, we delete spam comments but sometimes they remain on the blog. A few other things. We should be cleaning it up more in the next few days.

4 0.1600785 1694 andrew gelman stats-2013-01-26-Reflections on ethicsblogging

Introduction: I have to say, it distorts my internal incentives when I am happy to see really blatant examples of ethical lapses. Sort of like when you’re cleaning the attic and searching for roaches: on one hand, you’d be happy if there were none, but, still, there’s a thrill each time you find a roach and catch it—and, at that point, you want it to be a big ugly one!

5 0.1258263 2106 andrew gelman stats-2013-11-19-More on “data science” and “statistics”

Introduction: After reading Rachel and Cathy’s book , I wrote that “Statistics is the least important part of data science . . . I think it would be fair to consider statistics as a subset of data science. . . . it’s not the most important part of data science, or even close.” But then I received “Data Science for Business,” by Foster Provost and Tom Fawcett, in the mail. I might not have opened the book at all (as I’m hardly in the target audience) but for seeing a blurb by Chris Volinsky, a statistician whom I respect a lot. So I flipped through the book and it indeed looked pretty good. It moves slowly but that’s appropriate for an intro book. But what surprised me, given the book’s title and our recent discussion on the nature of data science, was that the book was 100% statistics! It had some math (for example, definitions of various distance measures), some simple algebra, some conceptual graphs such as ROC curve, some tables and graphs of low-dimensional data summaries—but almost

6 0.11763693 290 andrew gelman stats-2010-09-22-Data Thief

7 0.097826928 910 andrew gelman stats-2011-09-15-Google Refine

8 0.0975518 221 andrew gelman stats-2010-08-21-Busted!

9 0.086778343 897 andrew gelman stats-2011-09-09-The difference between significant and not significant…

10 0.084697403 1842 andrew gelman stats-2013-05-05-Cleaning up science

11 0.083812527 1072 andrew gelman stats-2011-12-19-“The difference between . . .”: It’s not just p=.05 vs. p=.06

12 0.068449691 1450 andrew gelman stats-2012-08-08-My upcoming talk for the data visualization meetup

13 0.065780506 1252 andrew gelman stats-2012-04-08-Jagdish Bhagwati’s definition of feminist sincerity

14 0.064840719 624 andrew gelman stats-2011-03-22-A question about the economic benefits of universities

15 0.064196177 1668 andrew gelman stats-2013-01-11-My talk at the NY data visualization meetup this Monday!

16 0.062665544 2286 andrew gelman stats-2014-04-08-Understanding Simpson’s paradox using a graph

17 0.062569633 773 andrew gelman stats-2011-06-18-Should we always be using the t and robit instead of the normal and logit?

18 0.062485561 1054 andrew gelman stats-2011-12-12-More frustrations trying to replicate an analysis published in a reputable journal

19 0.061138473 1920 andrew gelman stats-2013-06-30-“Non-statistical” statistics tools

20 0.060403105 2151 andrew gelman stats-2013-12-27-Should statistics have a Nobel prize?


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.064), (1, -0.0), (2, -0.018), (3, 0.001), (4, 0.024), (5, -0.005), (6, -0.012), (7, -0.004), (8, 0.018), (9, 0.014), (10, 0.004), (11, -0.005), (12, 0.024), (13, -0.012), (14, 0.014), (15, 0.013), (16, -0.015), (17, -0.023), (18, 0.006), (19, -0.003), (20, -0.016), (21, -0.014), (22, -0.023), (23, 0.0), (24, -0.037), (25, 0.012), (26, 0.027), (27, -0.01), (28, 0.028), (29, -0.027), (30, 0.008), (31, 0.016), (32, 0.031), (33, -0.033), (34, -0.019), (35, 0.04), (36, -0.004), (37, 0.019), (38, -0.025), (39, 0.04), (40, -0.006), (41, 0.04), (42, -0.015), (43, 0.016), (44, -0.008), (45, 0.003), (46, 0.006), (47, -0.004), (48, 0.028), (49, -0.025)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.93889987 424 andrew gelman stats-2010-11-21-Data cleaning tool!

Introduction: Hal Varian writes: You might find this a useful tool for cleaning data. I haven’t tried it out yet, but data cleaning is a hugely important topic and so this could be a big deal.

2 0.64396882 1808 andrew gelman stats-2013-04-17-Excel-bashing

Introduction: In response to the latest controversy , a statistics professor writes: It’s somewhat surprising to see Very Serious Researchers (apologies to Paul Krugman) using Excel. Some years ago, I was consulting on a trademark infringement case and was trying (unsuccessfully) to replicate another expert’s regression analysis. It wasn’t until I had the brainstorm to use Excel that I was able to reproduce his results – it may be better now, but at the time, Excel could propagate round-off error and catastrophically cancel like no other software! Microsoft has lots of top researchers so it’s hard for me to understand how Excel can remain so crappy. I mean, sure, I understand in some general way that they have a large user base, it’s hard to maintain backward compatibility, there’s feature creep, and, besides all that, lots of people have different preferences in data analysis than I do. But still, it’s such a joke. Word has problems too, but I can see how these problems arise from its d

3 0.63722616 524 andrew gelman stats-2011-01-19-Data exploration and multiple comparisons

Introduction: Bill Harris writes: I’ve read your paper and presentation showing why you don’t usually worry about multiple comparisons. I see how that applies when you are comparing results across multiple settings (states, etc.). Does the same principle hold when you are exploring data to find interesting relationships? For example, you have some data, and you’re trying a series of models to see which gives you the most useful insight. Do you try your models on a subset of the data so you have another subset for confirmatory analysis later, or do you simply throw all the data against your models? My reply: I’d like to estimate all the relationships at once and use a multilevel model to do partial pooling to handle the mutiplicity issues. That said, in practice, in my applied work I’m always bouncing back and forth between different hypotheses and different datasets, and often I learn a lot when next year’s data come in and I can modify my hypotheses. The trouble with the classical

4 0.62750578 1973 andrew gelman stats-2013-08-08-For chrissake, just make up an analysis already! We have a lab here to run, y’know?

Introduction: Ben Hyde sends along this : Stuck in the middle of the supplemental data, reporting the total workup for their compounds, was this gem: Emma, please insert NMR data here! where are they? and for this compound, just make up an elemental analysis . . . I’m reminded of our recent discussions of coauthorship, where I argued that I see real advantages to having multiple people taking responsibility for the result. Jay Verkuilen responded: “On the flipside of collaboration . . . is diffusion of responsibility, where everybody thinks someone else ‘has that problem’ and thus things don’t get solved.” That’s what seems to have happened (hilariously) here.

5 0.61805034 2307 andrew gelman stats-2014-04-27-Big Data…Big Deal? Maybe, if Used with Caution.

Introduction: This post is by David K. Park As we have witnessed, the term “big data” has been thrusted onto the zeitgeist in the past several years, however, when one pushes beyond the hype, there seems to be little substance there. We’ve always had “data” so what so unique about it this time? Yes, we recognize it’s “big” but is there anything unique about data this time around? I’ve spend some time thinking about this and the answer seems to be yes, and it falls on three dimensions: Capturing Conversations & Relationships : Individuals have always communicated with one another, but now we can capture some of that conversation – email, blogs, social media (Facebook, Twitter, Pinterest) – and we can now do it with machines via sensors, ie “the internet of things” as we hear so much about; Granularity : We can now understand individuals at a much finer level of analysis. No longer do we need to rely on a sample size of 500 people to “represent” the nation, but instead we can acc

6 0.61530703 211 andrew gelman stats-2010-08-17-Deducer update

7 0.61513966 1920 andrew gelman stats-2013-06-30-“Non-statistical” statistics tools

8 0.61337322 910 andrew gelman stats-2011-09-15-Google Refine

9 0.61312616 1212 andrew gelman stats-2012-03-14-Controversy about a ranking of philosophy departments, or How should we think about statistical results when we can’t see the raw data?

10 0.60200036 450 andrew gelman stats-2010-12-04-The Joy of Stats

11 0.60000294 1690 andrew gelman stats-2013-01-23-When are complicated models helpful in psychology research and when are they overkill?

12 0.59669548 290 andrew gelman stats-2010-09-22-Data Thief

13 0.59623379 192 andrew gelman stats-2010-08-08-Turning pages into data

14 0.59309608 1837 andrew gelman stats-2013-05-03-NYC Data Skeptics Meetup

15 0.58936596 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?

16 0.589284 2345 andrew gelman stats-2014-05-24-An interesting mosaic of a data programming course

17 0.57038909 1727 andrew gelman stats-2013-02-19-Beef with data

18 0.57020515 1971 andrew gelman stats-2013-08-07-I doubt they cheated

19 0.56789511 946 andrew gelman stats-2011-10-07-Analysis of Power Law of Participation

20 0.5677675 1142 andrew gelman stats-2012-01-29-Difficulties with the 1-4-power transformation


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(13, 0.256), (21, 0.078), (24, 0.02), (30, 0.088), (96, 0.073), (99, 0.274)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.91903627 424 andrew gelman stats-2010-11-21-Data cleaning tool!

Introduction: Hal Varian writes: You might find this a useful tool for cleaning data. I haven’t tried it out yet, but data cleaning is a hugely important topic and so this could be a big deal.

2 0.8968733 1559 andrew gelman stats-2012-11-02-The blog is back

Introduction: We had some security problem: not an actual virus or anything, but a potential leak which caused Google to blacklist us. Cord fixed us and now we’re fine. Good job, Google! Better to find the potential problem before there is any harm!

3 0.88916004 345 andrew gelman stats-2010-10-15-Things we do on sabbatical instead of actually working

Introduction: Frank Fischer, a political scientist at Rutgers U., says his alleged plagiarism was mere sloppiness and not all that uncommon in scholarship. I’ve heard about plagiarism but I had no idea it occurred in political science.

4 0.85469323 817 andrew gelman stats-2011-07-23-New blog home

Introduction: Hi all. We’ve moved the blog and are still working out some bugs. For example, we delete spam comments but sometimes they remain on the blog. A few other things. We should be cleaning it up more in the next few days.

5 0.85327506 1519 andrew gelman stats-2012-10-02-Job!

Introduction: Faten Sabry writes: We are looking to hire full time analysts at the undergraduate and graduate levels. The work involves extensive econometric analysis and handling of large databases. The analysts will be part of a team working to address various empirical microeconomic issues. I worked with Faten and her colleagues on a consulting project once, and they seemed like reasonable people to me.

6 0.84798527 1514 andrew gelman stats-2012-09-28-AdviseStat 47% Campaign Ad

7 0.84775937 1852 andrew gelman stats-2013-05-12-Crime novels for economists

8 0.83548057 234 andrew gelman stats-2010-08-25-Modeling constrained parameters

9 0.83099568 1789 andrew gelman stats-2013-04-05-Elites have alcohol problems too!

10 0.82950759 172 andrew gelman stats-2010-07-30-Why don’t we have peer reviewing for oral presentations?

11 0.82700145 2011 andrew gelman stats-2013-09-07-Here’s what happened when I finished my PhD thesis

12 0.82091314 971 andrew gelman stats-2011-10-25-Apply now for Earth Institute postdoctoral fellowships at Columbia University

13 0.80279875 597 andrew gelman stats-2011-03-02-RStudio – new cross-platform IDE for R

14 0.79928011 1916 andrew gelman stats-2013-06-27-The weirdest thing about the AJPH story

15 0.78315806 1137 andrew gelman stats-2012-01-24-Difficulties in publishing non-replications of implausible findings

16 0.78291547 437 andrew gelman stats-2010-11-29-The mystery of the U-shaped relationship between happiness and age

17 0.77848387 800 andrew gelman stats-2011-07-13-I like lineplots

18 0.76823497 1942 andrew gelman stats-2013-07-17-“Stop and frisk” statistics

19 0.7633127 2069 andrew gelman stats-2013-10-19-R package for effect size calculations for psychology researchers

20 0.76218903 1509 andrew gelman stats-2012-09-24-Analyzing photon counts