andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-2084 knowledge-graph by maker-knowledge-mining

2084 andrew gelman stats-2013-11-01-Doing Data Science: What’s it all about?


meta infos for this blog

Source: html

Introduction: Rachel Schutt and Cathy O’Neil just came out with a wonderfully readable book on doing data science, based on a course Rachel taught last year at Columbia. Rachel is a former Ph.D. student of mine and so I’m inclined to have a positive view of her work; on the other hand, I did actually look at the book and I did find it readable! What do I claim is the least important part of data science? Here’s what Schutt and O’Neil say regarding the title: “Data science is not just a rebranding of statistics or machine learning but rather a field unto itself.” I agree. There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. The question then arises: why do descriptions of data science focus so


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Rachel Schutt and Cathy O’Neil just came out with a wonderfully readable book on doing data science, based on a course Rachel taught last year at Columbia. [sent-1, score-0.395]

2 I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. [sent-9, score-0.598]

3 The question then arises: why do descriptions of data science focus so strongly on statistical tasks? [sent-10, score-0.382]

4 (As Schutt and O’Neil write, “the media often describes data science in a way that makes it sound like as if it’s simply statistics or machine learning in the context of the tech industry. [sent-11, score-0.59]

5 The statistical part of data science is more of an option. [sent-14, score-0.486]

6 But in recent years, lots of tech companies have made use of statistical methods (including various statistical ideas that have been developed in the computer science literature). [sent-16, score-0.402]

7 So, from the industry perspective, the new part of data science is the statistics. [sent-17, score-0.491]

8 Statistics is the least important part of data science, hence it is the part most recently added, hence it is the part that is getting the most attention right now. [sent-18, score-0.488]

9 ” Although there might be truth in there, that doesn’t mean that the term “data science” itself represents nothing, but of course what it represents may not be science but more of a craft. [sent-20, score-0.356]

10 But privacy takes lives too, as we see from this story of emergency room deaths. [sent-47, score-0.346]

11 Here’s Wikipedia: “A town is a human settlement larger than a village but smaller than a city. [sent-53, score-0.522]

12 In the United States of America, the term “town” refers to an area of population distinct from others in some meaningful dimension, typically population or type of government. [sent-58, score-0.413]

13 In some instances, the term “town” refers to a small incorporated municipality of less than 10,000 people, while in others a town can be significantly larger. [sent-62, score-0.853]

14 So that can’t be right—there’s no way that half the deaths in this town are caused by poor record-keeping in a hospital. [sent-71, score-0.766]

15 If the town had 20,000 people (which would seem to be near the upper limit of the population of a town that one would call “smallish,” at least in the United States), then we’re talking 1/4 of the deaths, which still seems way too large a proportion. [sent-72, score-1.331]

16 Even if it is a town with lots of old people, so that much more than 1/70 of the population is dropping off each year, the numbers just don’t seem to add up. [sent-73, score-0.698]

17 Maybe the town happens to have a large regional hospital. [sent-74, score-0.579]

18 But, 75 excess deaths a year caused by “lack of information flow” still seems like a lot, and if the patients are drawn from a large population, it seems a bit misleading to describe these deaths as being “in a certain smallish town. [sent-75, score-0.855]

19 From a statistical perspective, we want to know the data-generation process (also called the likelihood function, also called “where did the data come from”) as well as the numerical data (or, in this case, the story, or anecdote, or parable, itself). [sent-82, score-0.407]

20 Summary I enjoyed Rachel and Cathy’s book, it’s readable, informative, and like no other book I’ve read on the topic of statistics or data science. [sent-83, score-0.371]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('town', 0.522), ('smallish', 0.209), ('schutt', 0.209), ('rachel', 0.197), ('hadoop', 0.192), ('deaths', 0.19), ('data', 0.176), ('science', 0.151), ('tech', 0.141), ('neil', 0.139), ('population', 0.126), ('part', 0.104), ('term', 0.103), ('readable', 0.096), ('municipality', 0.096), ('patients', 0.095), ('privacy', 0.084), ('cathy', 0.084), ('chapter', 0.081), ('emergency', 0.081), ('story', 0.079), ('wikipedia', 0.078), ('incorporated', 0.074), ('statistics', 0.07), ('hospital', 0.07), ('hand', 0.068), ('flow', 0.067), ('book', 0.063), ('enjoyed', 0.062), ('coding', 0.062), ('industry', 0.06), ('records', 0.06), ('year', 0.06), ('refers', 0.058), ('large', 0.057), ('states', 0.057), ('statistical', 0.055), ('definition', 0.054), ('match', 0.054), ('caused', 0.054), ('call', 0.054), ('map', 0.053), ('room', 0.052), ('machine', 0.052), ('city', 0.051), ('big', 0.051), ('represents', 0.051), ('seem', 0.05), ('reduce', 0.05), ('lives', 0.05)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999982 2084 andrew gelman stats-2013-11-01-Doing Data Science: What’s it all about?

Introduction: Rachel Schutt and Cathy O’Neil just came out with a wonderfully readable book on doing data science, based on a course Rachel taught last year at Columbia. Rachel is a former Ph.D. student of mine and so I’m inclined to have a positive view of her work; on the other hand, I did actually look at the book and I did find it readable! What do I claim is the least important part of data science? Here’s what Schutt and O’Neil say regarding the title: “Data science is not just a rebranding of statistics or machine learning but rather a field unto itself.” I agree. There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. The question then arises: why do descriptions of data science focus so

2 0.69306129 2184 andrew gelman stats-2014-01-24-Parables vs. stories

Introduction: God is in every leaf of every tree , but he is not in every leaf of every parable. Let me explain with a story. A few months ago I read the new book, Doing Data Science, by Rachel Schutt and Cathy O’Neal, and I came across the following motivation for comprehensive integration of data sources, a story that is reminiscent of the parables we sometimes see in business books: By some estimates, one or two patients died per week in a certain smallish town because of the lack of information flow between the hospital’s emergency room and the nearby mental health clinic. In other words, if the records had been easier to match, they’d have been able to save more lives. On the other hand, if it had been easy to match records, other breaches of confidence might also have occurred. Of course it’s hard to know exactly how many lives are at stake, but it’s nontrivial. The moral: We can assume we think privacy is a generally good thing. . . . But privacy takes lives too, as we see from

3 0.22950366 2106 andrew gelman stats-2013-11-19-More on “data science” and “statistics”

Introduction: After reading Rachel and Cathy’s book , I wrote that “Statistics is the least important part of data science . . . I think it would be fair to consider statistics as a subset of data science. . . . it’s not the most important part of data science, or even close.” But then I received “Data Science for Business,” by Foster Provost and Tom Fawcett, in the mail. I might not have opened the book at all (as I’m hardly in the target audience) but for seeing a blurb by Chris Volinsky, a statistician whom I respect a lot. So I flipped through the book and it indeed looked pretty good. It moves slowly but that’s appropriate for an intro book. But what surprised me, given the book’s title and our recent discussion on the nature of data science, was that the book was 100% statistics! It had some math (for example, definitions of various distance measures), some simple algebra, some conceptual graphs such as ROC curve, some tables and graphs of low-dimensional data summaries—but almost

4 0.19990121 195 andrew gelman stats-2010-08-09-President Carter

Introduction: This assessment by Tyler Cowen reminded me that, in 1980, I and just about all my friends hated Jimmy Carter. Most of us much preferred him to Reagan but still hated Carter. I wouldn’t associate this with any particular ideological feeling—it’s not that we thought he was too liberal, or too conservative. He just seemed completely ineffectual. I remember feeling at the time that he had no principles, that he’d do anything to get elected. In retrospect, I think of this as an instance of uniform partisan swing: the president was unpopular nationally, and attitudes about him were negative, relatively speaking, among just about every group. My other Carter story comes from a conversation I had a couple years ago with an economist who’s about my age, a man who said that one reason he and his family moved from town A to town B in his metropolitan area was that, in town B, they didn’t feel like they were the only Republicans on their block. Anyway, this guy described himself as a “

5 0.15347205 1517 andrew gelman stats-2012-10-01-“On Inspiring Students and Being Human”

Introduction: Rachel Schutt (the author of the Taxonomy of Confusion) has a blog! for the course she’s teaching at Columbia, “Introduction to Data Science.” It sounds like a great course—I wish I could take it! Her latest post is “On Inspiring Students and Being Human”: Of course one hopes as a teacher that one will inspire students . . . But what I actually mean by “inspiring students” is that you are inspiring me; you are students who inspire: “inspiring students”. This is one of the happy unintended consequences of this course so far for me. She then gives examples of some of the students in her class and some of their interesting ideas: Phillip is a PhD student in the sociology department . . . He’s in the process of developing his thesis topic around some of the themes we’ve been discussing in this class, such as the emerging data science community. Arvi works at the College Board and is a part time student . . . He analyzes user-level data of students who have signed up f

6 0.13800278 1837 andrew gelman stats-2013-05-03-NYC Data Skeptics Meetup

7 0.13282938 395 andrew gelman stats-2010-11-05-Consulting: how do you figure out what to charge?

8 0.13203132 1482 andrew gelman stats-2012-09-04-Model checking and model understanding in machine learning

9 0.12792166 2255 andrew gelman stats-2014-03-19-How Americans vote

10 0.1236013 624 andrew gelman stats-2011-03-22-A question about the economic benefits of universities

11 0.12206996 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes

12 0.11865681 1014 andrew gelman stats-2011-11-16-Visualizations of NYPD stop-and-frisk data

13 0.11751144 87 andrew gelman stats-2010-06-15-Statistical analysis and visualization of the drug war in Mexico

14 0.1155593 1832 andrew gelman stats-2013-04-29-The blogroll

15 0.11037358 972 andrew gelman stats-2011-10-25-How do you interpret standard errors from a regression fit to the entire population?

16 0.10975015 1972 andrew gelman stats-2013-08-07-When you’re planning on fitting a model, build up to it by fitting simpler models first. Then, once you have a model you like, check the hell out of it

17 0.10962464 2245 andrew gelman stats-2014-03-12-More on publishing in journals

18 0.10809399 719 andrew gelman stats-2011-05-19-Everything is Obvious (once you know the answer)

19 0.1062805 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?

20 0.10603222 690 andrew gelman stats-2011-05-01-Peter Huber’s reflections on data analysis


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.272), (1, -0.03), (2, -0.042), (3, 0.026), (4, 0.02), (5, 0.028), (6, -0.031), (7, 0.045), (8, 0.034), (9, 0.018), (10, -0.042), (11, -0.02), (12, 0.003), (13, -0.01), (14, 0.015), (15, 0.021), (16, -0.01), (17, -0.02), (18, 0.088), (19, -0.061), (20, 0.017), (21, -0.006), (22, -0.062), (23, -0.006), (24, -0.047), (25, 0.036), (26, -0.04), (27, 0.021), (28, 0.042), (29, 0.073), (30, 0.026), (31, -0.03), (32, -0.043), (33, 0.015), (34, 0.051), (35, 0.115), (36, -0.036), (37, -0.019), (38, -0.008), (39, 0.018), (40, -0.033), (41, -0.067), (42, -0.013), (43, -0.011), (44, 0.003), (45, 0.03), (46, 0.01), (47, 0.031), (48, -0.011), (49, 0.015)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96400011 2084 andrew gelman stats-2013-11-01-Doing Data Science: What’s it all about?

Introduction: Rachel Schutt and Cathy O’Neil just came out with a wonderfully readable book on doing data science, based on a course Rachel taught last year at Columbia. Rachel is a former Ph.D. student of mine and so I’m inclined to have a positive view of her work; on the other hand, I did actually look at the book and I did find it readable! What do I claim is the least important part of data science? Here’s what Schutt and O’Neil say regarding the title: “Data science is not just a rebranding of statistics or machine learning but rather a field unto itself.” I agree. There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. The question then arises: why do descriptions of data science focus so

2 0.85524809 2184 andrew gelman stats-2014-01-24-Parables vs. stories

Introduction: God is in every leaf of every tree , but he is not in every leaf of every parable. Let me explain with a story. A few months ago I read the new book, Doing Data Science, by Rachel Schutt and Cathy O’Neal, and I came across the following motivation for comprehensive integration of data sources, a story that is reminiscent of the parables we sometimes see in business books: By some estimates, one or two patients died per week in a certain smallish town because of the lack of information flow between the hospital’s emergency room and the nearby mental health clinic. In other words, if the records had been easier to match, they’d have been able to save more lives. On the other hand, if it had been easy to match records, other breaches of confidence might also have occurred. Of course it’s hard to know exactly how many lives are at stake, but it’s nontrivial. The moral: We can assume we think privacy is a generally good thing. . . . But privacy takes lives too, as we see from

3 0.81956601 1276 andrew gelman stats-2012-04-22-“Gross misuse of statistics” can be a good thing, if it indicates the acceptance of the importance of statistical reasoning

Introduction: Rick Lightburn writes: I [Lightburn] am also a member of the group Business Analytics on LinkedIn. I am struck by what I perceive as the gross misuse of statistics by the members of this group, including things that (I thought) were taught in Introductory Statistics courses in business schools. I want to suggest to you that you look at the discussions there if you want examples of such abuse. The discussions there support me in my belief that Analytics is data manipulation in the support of previously developed conclusions. My reply: I don’t think it’s such a bad thing. I like when people make statistical arguments, even bad statistical arguments. Once you accept the concept of arguing from logic and data, maybe you’ll be open to learning something new

4 0.78976285 685 andrew gelman stats-2011-04-29-Data mining and allergies

Introduction: With all this data floating around, there are some interesting analyses one can do. I came across “The Association of Tree Pollen Concentration Peaks and Allergy Medication Sales in New York City: 2003-2008″ by Perry Sheffield . There they correlate pollen counts with anti-allergy medicine sales – and indeed find that two days after high pollen counts, the medicine sales are the highest. Of course, it would be interesting to play with the data to see *what* tree is actually causing the sales to increase the most. Perhaps this would help the arborists what trees to plant. At the moment they seem to be following a rather sexist approach to tree planting: Ogren says the city could solve the problem by planting only female trees, which don’t produce pollen like male trees do. City arborists shy away from females because many produce messy – or in the case of ginkgos, smelly – fruit that litters sidewalks. In Ogren’s opinion, that’s a mistake. He says the females only pro

5 0.78883034 690 andrew gelman stats-2011-05-01-Peter Huber’s reflections on data analysis

Introduction: Peter Huber’s most famous work derives from his paper on robust statistics published nearly fifty years ago in which he introduced the concept of M-estimation (a generalization of maximum likelihood) to unify some ideas of Tukey and others for estimation procedures that were relatively insensitive to small departures from the assumed model. Huber has in many ways been ahead of his time. While remaining connected to the theoretical ideas from the early part of his career, his interests have shifted to computational and graphical statistics. I never took Huber’s class on data analysis–he left Harvard while I was still in graduate school–but fortunately I have an opportunity to learn his lessons now, as he has just released a book, “Data Analysis: What Can Be Learned from the Past 50 Years.” The book puts together a few articles published in the past 15 years, along with some new material. Many of the examples are decades old, which is appropriate given that Huber is reviewing f

6 0.78519303 2345 andrew gelman stats-2014-05-24-An interesting mosaic of a data programming course

7 0.78124648 2026 andrew gelman stats-2013-09-16-He’s adult entertainer, Child educator, King of the crossfader, He’s the greatest of the greater, He’s a big bad wolf in your neighborhood, Not bad meaning bad but bad meaning good

8 0.78086609 1023 andrew gelman stats-2011-11-22-Going Beyond the Book: Towards Critical Reading in Statistics Teaching

9 0.77415258 2106 andrew gelman stats-2013-11-19-More on “data science” and “statistics”

10 0.7657693 1722 andrew gelman stats-2013-02-14-Statistics for firefighters: update

11 0.75675964 1525 andrew gelman stats-2012-10-08-Ethical standards in different data communities

12 0.75459296 2115 andrew gelman stats-2013-11-27-Three unblinded mice

13 0.75348443 1541 andrew gelman stats-2012-10-19-Statistical discrimination again

14 0.75244421 946 andrew gelman stats-2011-10-07-Analysis of Power Law of Participation

15 0.75198495 2065 andrew gelman stats-2013-10-17-Cool dynamic demographic maps provide beautiful illustration of Chris Rock effect

16 0.75106484 1212 andrew gelman stats-2012-03-14-Controversy about a ranking of philosophy departments, or How should we think about statistical results when we can’t see the raw data?

17 0.75076997 2307 andrew gelman stats-2014-04-27-Big Data…Big Deal? Maybe, if Used with Caution.

18 0.74991786 203 andrew gelman stats-2010-08-12-John McPhee, the Anti-Malcolm

19 0.74916846 1123 andrew gelman stats-2012-01-17-Big corporations are more popular than you might realize

20 0.74342602 1927 andrew gelman stats-2013-07-05-“Numbersense: How to use big data to your advantage”


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(5, 0.021), (9, 0.025), (10, 0.017), (16, 0.136), (24, 0.102), (27, 0.033), (44, 0.048), (46, 0.041), (48, 0.018), (53, 0.01), (86, 0.052), (98, 0.011), (99, 0.351)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97693789 2106 andrew gelman stats-2013-11-19-More on “data science” and “statistics”

Introduction: After reading Rachel and Cathy’s book , I wrote that “Statistics is the least important part of data science . . . I think it would be fair to consider statistics as a subset of data science. . . . it’s not the most important part of data science, or even close.” But then I received “Data Science for Business,” by Foster Provost and Tom Fawcett, in the mail. I might not have opened the book at all (as I’m hardly in the target audience) but for seeing a blurb by Chris Volinsky, a statistician whom I respect a lot. So I flipped through the book and it indeed looked pretty good. It moves slowly but that’s appropriate for an intro book. But what surprised me, given the book’s title and our recent discussion on the nature of data science, was that the book was 100% statistics! It had some math (for example, definitions of various distance measures), some simple algebra, some conceptual graphs such as ROC curve, some tables and graphs of low-dimensional data summaries—but almost

2 0.97605234 722 andrew gelman stats-2011-05-20-Why no Wegmania?

Introduction: A colleague asks: When I search the web, I find the story [of the article by Said, Wegman, et al. on social networks in climate research, which was recently bumped from the journal Computational Statistics and Data Analysis because of plagiarism] only on blogs, USA Today, and UPI. Why is that? Any idea why it isn’t reported by any of the major newspapers? Here’s my answer: 1. USA Today broke the story. Apparently this USA Today reporter put a lot of effort into it. The NYT doesn’t like to run a story that begins, “Yesterday, USA Today reported…” 2. To us it’s big news because we’re statisticians. [The main guy in the study, Edward Wegman, won the Founders Award from the American Statistical Association a few years ago.] To the rest of the world, the story is: “Obscure prof at an obscure college plagiarized an article in a journal that nobody’s ever heard of.” When a Harvard scientist paints black dots on white mice and says he’s curing cancer, that’s news. When P

3 0.97396344 154 andrew gelman stats-2010-07-18-Predictive checks for hierarchical models

Introduction: Daniel Corsi writes: I was wondering if you could help me with some code to set up a posterior predictive check for an unordered multinomial multilevel model. In this case the outcome is categories of bmi (underweight, nomral weight, and overweight) based on individuals from 360 different areas. What I would like to do is set up a replicated dataset to see how the number of overweight/underweight/normal weight individuals based on the model compares to the actual data and some kind of a graphical summary. I am following along with chapter 24 of the arm book but I want to verify that the replicated data accounts for the multilevel structure of the data of people within areas. I am attaching the code I used to run a simple model with only 2 predictors (area wealth and urban/rural designation). My reply: The Bugs code is a bit much for me to look at–but I do recommend that you run it from R, which will give you more flexibility in preprocessing and postprocessing the data. Beyon

same-blog 4 0.97308326 2084 andrew gelman stats-2013-11-01-Doing Data Science: What’s it all about?

Introduction: Rachel Schutt and Cathy O’Neil just came out with a wonderfully readable book on doing data science, based on a course Rachel taught last year at Columbia. Rachel is a former Ph.D. student of mine and so I’m inclined to have a positive view of her work; on the other hand, I did actually look at the book and I did find it readable! What do I claim is the least important part of data science? Here’s what Schutt and O’Neil say regarding the title: “Data science is not just a rebranding of statistics or machine learning but rather a field unto itself.” I agree. There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. The question then arises: why do descriptions of data science focus so

5 0.96985239 2280 andrew gelman stats-2014-04-03-As the boldest experiment in journalism history, you admit you made a mistake

Introduction: The pre-NYT David Brooks liked to make fun of the NYT. Here’s one from 1997 : I’m not sure I’d like to be one of the people featured on the New York Times wedding page, but I know I’d like to be the father of one of them. Imagine how happy Stanley J. Kogan must have been, for example, when his daughter Jamie got into Yale. Then imagine his pride when Jamie made Phi Beta Kappa and graduated summa cum laude. . . . he must have enjoyed a gloat or two when his daughter put on that cap and gown. And things only got better. Jamie breezed through Stanford Law School. And then she met a man—Thomas Arena—who appears to be exactly the sort of son-in-law that pediatric urologists dream about. . . . These two awesome resumes collided at a wedding ceremony . . . It must have been one of the happiest days in Stanley J. Kogan’s life. The rest of us got to read about it on the New York Times wedding page. Brooks is reputed to be Jewish himself so I think it’s ok for him to mock Jewish peop

6 0.96863621 2182 andrew gelman stats-2014-01-22-Spell-checking example demonstrates key aspects of Bayesian data analysis

7 0.9656983 2184 andrew gelman stats-2014-01-24-Parables vs. stories

8 0.9655081 1083 andrew gelman stats-2011-12-26-The quals and the quants

9 0.96547449 571 andrew gelman stats-2011-02-13-A departmental wiki page?

10 0.96531212 110 andrew gelman stats-2010-06-26-Philosophy and the practice of Bayesian statistics

11 0.96491265 54 andrew gelman stats-2010-05-27-Hype about conditional probability puzzles

12 0.96481061 2301 andrew gelman stats-2014-04-22-Ticket to Baaaaarf

13 0.96455455 2107 andrew gelman stats-2013-11-20-NYT (non)-retraction watch

14 0.96447241 2137 andrew gelman stats-2013-12-17-Replication backlash

15 0.96433401 859 andrew gelman stats-2011-08-18-Misunderstanding analysis of covariance

16 0.96370929 1917 andrew gelman stats-2013-06-28-Econ coauthorship update

17 0.96360701 2130 andrew gelman stats-2013-12-11-Multilevel marketing as a way of liquidating participants’ social networks

18 0.96339262 1729 andrew gelman stats-2013-02-20-My beef with Brooks: the alternative to “good statistics” is not “no statistics,” it’s “bad statistics”

19 0.96279919 1656 andrew gelman stats-2013-01-05-Understanding regression models and regression coefficients

20 0.96243638 935 andrew gelman stats-2011-10-01-When should you worry about imputed data?