andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-1920 knowledge-graph by maker-knowledge-mining

1920 andrew gelman stats-2013-06-30-“Non-statistical” statistics tools


meta infos for this blog

Source: html

Introduction: Ulrich Atz writes: I regard myself fairly familiar with modern “big data” tools and models such as random forests, SVM etc. However, HyperCube is something I haven’t come across yet (met the marketing guy last week) and they advertise it as “disruptive”, “unique”, “best performing data analysis tool available”. Have you seen it in action? Perhaps performing in any data science style competition? On a side note, they claim it is “non-statistical” which I find absurd. A marketing ploy, but sounds like physics without math. Hence, my question: Do you think there is such a thing as a (1) non-statistical data analysis and (2) non-statistical data set? Here’s what’s on the webpage: The technology is non-statistical, meaning it does not take a sample and use algorithms in order to validate a hypothesis. Instead, it takes input from a large volume of data and outputs the results from the data alone. This means that all the available data is taken into account. I’m not


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ulrich Atz writes: I regard myself fairly familiar with modern “big data” tools and models such as random forests, SVM etc. [sent-1, score-0.409]

2 However, HyperCube is something I haven’t come across yet (met the marketing guy last week) and they advertise it as “disruptive”, “unique”, “best performing data analysis tool available”. [sent-2, score-0.808]

3 A marketing ploy, but sounds like physics without math. [sent-6, score-0.393]

4 Hence, my question: Do you think there is such a thing as a (1) non-statistical data analysis and (2) non-statistical data set? [sent-7, score-0.48]

5 Here’s what’s on the webpage: The technology is non-statistical, meaning it does not take a sample and use algorithms in order to validate a hypothesis. [sent-8, score-0.64]

6 Instead, it takes input from a large volume of data and outputs the results from the data alone. [sent-9, score-1.044]

7 This means that all the available data is taken into account. [sent-10, score-0.524]

8 I’m not quite what’s the difference between “take a sample” and “take input from a large volume of data. [sent-11, score-0.445]

9 ” All their examples involve generalizing from their sample data to a population. [sent-12, score-0.586]

10 The webpage continues: The lack of a hypothesis is another advantage of HyperCube over statistics. [sent-14, score-0.258]

11 HyperCube exposes the rules and dependencies that are indicated by the data, and is not tied to any previously held view. [sent-15, score-0.491]

12 Statistics, on the other hand, test data to see whether it proves a specified scenario. [sent-16, score-0.463]

13 The available data are not the point, they are a means to the larger goal of making predictions about future cases. [sent-18, score-0.595]

14 That said, even if the authors of this press material are confused about statistical inference and sampling, the software package could be good. [sent-19, score-0.235]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('hypercube', 0.442), ('data', 0.24), ('marketing', 0.188), ('webpage', 0.185), ('input', 0.181), ('available', 0.178), ('volume', 0.176), ('performing', 0.176), ('sample', 0.149), ('svm', 0.147), ('forests', 0.147), ('sounds', 0.132), ('dependencies', 0.128), ('advertise', 0.128), ('proves', 0.128), ('outputs', 0.119), ('validate', 0.116), ('misses', 0.114), ('take', 0.113), ('generalizing', 0.112), ('means', 0.106), ('regard', 0.104), ('indicated', 0.099), ('specified', 0.095), ('tied', 0.093), ('previously', 0.089), ('technology', 0.088), ('large', 0.088), ('algorithms', 0.088), ('action', 0.088), ('competition', 0.087), ('meaning', 0.086), ('fairly', 0.086), ('involve', 0.085), ('confused', 0.082), ('held', 0.082), ('met', 0.082), ('unique', 0.078), ('package', 0.077), ('tool', 0.076), ('press', 0.076), ('modern', 0.074), ('style', 0.073), ('advantage', 0.073), ('physics', 0.073), ('tools', 0.073), ('familiar', 0.072), ('continues', 0.072), ('predictions', 0.071), ('hence', 0.071)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 1920 andrew gelman stats-2013-06-30-“Non-statistical” statistics tools

Introduction: Ulrich Atz writes: I regard myself fairly familiar with modern “big data” tools and models such as random forests, SVM etc. However, HyperCube is something I haven’t come across yet (met the marketing guy last week) and they advertise it as “disruptive”, “unique”, “best performing data analysis tool available”. Have you seen it in action? Perhaps performing in any data science style competition? On a side note, they claim it is “non-statistical” which I find absurd. A marketing ploy, but sounds like physics without math. Hence, my question: Do you think there is such a thing as a (1) non-statistical data analysis and (2) non-statistical data set? Here’s what’s on the webpage: The technology is non-statistical, meaning it does not take a sample and use algorithms in order to validate a hypothesis. Instead, it takes input from a large volume of data and outputs the results from the data alone. This means that all the available data is taken into account. I’m not

2 0.11725719 1605 andrew gelman stats-2012-12-04-Write This Book

Introduction: This post is by Phil Price. I’ve been preparing a review of a new statistics textbook aimed at students and practitioners in the “physical sciences,” as distinct from the social sciences and also distinct from people who intend to take more statistics courses. I figured that since it’s been years since I looked at an intro stats textbook, I should look at a few others and see how they differ from this one, so in addition to the book I’m reviewing I’ve looked at some other textbooks aimed at similar audiences: Milton and Arnold; Hines, Montgomery, Goldsman, and Borror; and a few others. I also looked at the table of contents of several more. There is a lot of overlap in the coverage of these books — they all have discussions of common discrete and continuous distributions, joint distributions, descriptive statistics, parameter estimation, hypothesis testing, linear regression, ANOVA, factorial experimental design, and a few other topics. I can see how, from a statisti

3 0.11698046 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis

Introduction: I’ve been writing a lot about my philosophy of Bayesian statistics and how it fits into Popper’s ideas about falsification and Kuhn’s ideas about scientific revolutions. Here’s my long, somewhat technical paper with Cosma Shalizi. Here’s our shorter overview for the volume on the philosophy of social science. Here’s my latest try (for an online symposium), focusing on the key issues. I’m pretty happy with my approach–the familiar idea that Bayesian data analysis iterates the three steps of model building, inference, and model checking–but it does have some unresolved (maybe unresolvable) problems. Here are a couple mentioned in the third of the above links. Consider a simple model with independent data y_1, y_2, .., y_10 ~ N(θ,σ^2), with a prior distribution θ ~ N(0,10^2) and σ known and taking on some value of approximately 10. Inference about μ is straightforward, as is model checking, whether based on graphs or numerical summaries such as the sample variance and skewn

4 0.11635966 1628 andrew gelman stats-2012-12-17-Statistics in a world where nothing is random

Introduction: Rama Ganesan writes: I think I am having an existential crisis. I used to work with animals (rats, mice, gerbils etc.) Then I started to work in marketing research where we did have some kind of random sampling procedure. So up until a few years ago, I was sort of okay. Now I am teaching marketing research, and I feel like there is no real random sampling anymore. I take pains to get students to understand what random means, and then the whole lot of inferential statistics. Then almost anything they do – the sample is not random. They think I am contradicting myself. They use convenience samples at every turn – for their school work, and the enormous amount on online surveying that gets done. Do you have any suggestions for me? Other than say, something like this . My reply: Statistics does not require randomness. The three essential elements of statistics are measurement, comparison, and variation. Randomness is one way to supply variation, and it’s one way to model

5 0.1044268 1950 andrew gelman stats-2013-07-22-My talks that were scheduled for Tues at the Data Skeptics meetup and Wed at the Open Statistical Programming meetup

Introduction: Statistical Methods and Data Skepticism Data analysis today is dominated by three paradigms: null hypothesis significance testing, Bayesian inference, and exploratory data analysis. There is concern that all these methods lead to overconfidence on the part of researchers and the general public, and this concern has led to the new “data skepticism” movement. But the history of statistics is already in some sense a history of data skepticism. Concepts of bias, variance, sampling and measurement error, least-squares regression, and statistical significance can all be viewed as formalizations of data skepticism. All these methods address the concern that patterns in observed data might not generalize to the population of interest. We discuss the challenge of attaining data skepticism while avoiding data nihilism, and consider some proposed future directions. Stan Stan (mc-stan.org) is an open-source package for obtaining Bayesian inference using the No-U-Turn sampler, a

6 0.10279909 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?

7 0.10071553 1934 andrew gelman stats-2013-07-11-Yes, worry about generalizing from data to population. But multilevel modeling is the solution, not the problem

8 0.096110053 1289 andrew gelman stats-2012-04-29-We go to war with the data we have, not the data we want

9 0.095929213 1506 andrew gelman stats-2012-09-21-Building a regression model . . . with only 27 data points

10 0.095031388 304 andrew gelman stats-2010-09-29-Data visualization marathon

11 0.094164751 1482 andrew gelman stats-2012-09-04-Model checking and model understanding in machine learning

12 0.093762271 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes

13 0.093297601 2084 andrew gelman stats-2013-11-01-Doing Data Science: What’s it all about?

14 0.092847809 774 andrew gelman stats-2011-06-20-The pervasive twoishness of statistics; in particular, the “sampling distribution” and the “likelihood” are two different models, and that’s a good thing

15 0.090210877 1149 andrew gelman stats-2012-02-01-Philosophy of Bayesian statistics: my reactions to Cox and Mayo

16 0.089327939 2359 andrew gelman stats-2014-06-04-All the Assumptions That Are My Life

17 0.088648714 690 andrew gelman stats-2011-05-01-Peter Huber’s reflections on data analysis

18 0.087312475 2263 andrew gelman stats-2014-03-24-Empirical implications of Empirical Implications of Theoretical Models

19 0.087041289 490 andrew gelman stats-2010-12-29-Brain Structure and the Big Five

20 0.085789979 1418 andrew gelman stats-2012-07-16-Long discussion about causal inference and the use of hierarchical models to bridge between different inferential settings


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.18), (1, 0.046), (2, -0.017), (3, -0.04), (4, 0.044), (5, 0.031), (6, -0.068), (7, -0.002), (8, -0.003), (9, -0.005), (10, -0.022), (11, -0.027), (12, -0.006), (13, -0.038), (14, -0.028), (15, 0.002), (16, -0.008), (17, -0.039), (18, 0.038), (19, -0.033), (20, 0.006), (21, 0.006), (22, -0.021), (23, 0.017), (24, -0.053), (25, 0.001), (26, 0.001), (27, 0.008), (28, 0.077), (29, 0.015), (30, 0.035), (31, -0.043), (32, 0.003), (33, 0.025), (34, 0.001), (35, 0.109), (36, -0.045), (37, -0.004), (38, -0.022), (39, 0.058), (40, 0.008), (41, -0.01), (42, -0.014), (43, 0.028), (44, -0.033), (45, 0.025), (46, -0.004), (47, -0.046), (48, 0.024), (49, -0.065)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97280514 1920 andrew gelman stats-2013-06-30-“Non-statistical” statistics tools

Introduction: Ulrich Atz writes: I regard myself fairly familiar with modern “big data” tools and models such as random forests, SVM etc. However, HyperCube is something I haven’t come across yet (met the marketing guy last week) and they advertise it as “disruptive”, “unique”, “best performing data analysis tool available”. Have you seen it in action? Perhaps performing in any data science style competition? On a side note, they claim it is “non-statistical” which I find absurd. A marketing ploy, but sounds like physics without math. Hence, my question: Do you think there is such a thing as a (1) non-statistical data analysis and (2) non-statistical data set? Here’s what’s on the webpage: The technology is non-statistical, meaning it does not take a sample and use algorithms in order to validate a hypothesis. Instead, it takes input from a large volume of data and outputs the results from the data alone. This means that all the available data is taken into account. I’m not

2 0.84668243 1837 andrew gelman stats-2013-05-03-NYC Data Skeptics Meetup

Introduction: Rachel Schutt writes: The hype surrounding Big Data and Data Science is at a fever pitch with promises to solve the world’s business and social problems, large and small. How accurate or misleading is this message? How is it helping or damaging people, and which people? What opportunities exist for data nerds and entrepreneurs that examine the larger issues with a skeptical view? This Meetup focuses on mathematical, ethical, and business aspects of data from a skeptical perspective. Guest speakers will discuss the misuse of and best practices with data, common mistakes people make with data and ways to avoid them, how to deal with intentional gaming and politics surrounding mathematical modeling, and taking into account the feedback loops and wider consequences of modeling. We will take deep dives into models in the fields of Data Science, statistics, financial engineering, and economics. This is an independent forum and open to anyone sharing an interest in the larger use of

3 0.84065336 946 andrew gelman stats-2011-10-07-Analysis of Power Law of Participation

Introduction: Rick Wash writes: A colleague as USC (Lian Jian) and I were recently discussing a statistical analysis issue that both of us have run into recently. We both mostly do research about how people use online interactive websites. One property that most of these systems have is known as the “powerlaw of participation” — the distribution of the number of contributions from each person follows a powerlaw. This mean that a few people contribution a TON and many, many people are in the “long tail” and contribute very rarely. For example, Facebook posts and twitter posts both have this distribution, as do comments on blogs and many other forms of user contribution online. This distribution has proven to be a problem when we analyze individual behavior. The basic problem is that we’d like to account for the fact that we have repeated data from many users, but a large number of users only have 1 or 2 data points. For example, Lian recently analyzed data about monetary contributions

4 0.83429903 1853 andrew gelman stats-2013-05-12-OpenData Latinoamerica

Introduction: Miguel Paz writes : Poderomedia Foundation and PinLatam are launching OpenDataLatinoamerica.org, a regional data repository to free data and use it on Hackathons and other activities by HacksHackers chapters and other organizations. We are doing this because the road to the future of news has been littered with lost datasets. A day or so after every hackathon and meeting where a group has come together to analyze, compare and understand a particular set of data, someone tries to remember where the successful files were stored. Too often, no one is certain. Therefore with Mariano Blejman we realized that we need a central repository where you can share the data that you have proved to be reliable: OpenData Latinoamerica, which we are leading as ICFJ Knight International Journalism Fellows. If you work in Latin America or Central America your organization can take part in OpenDataLatinoamerica.org. To apply, go to the website and answer a simple form agreeing to meet the standard

5 0.83270812 192 andrew gelman stats-2010-08-08-Turning pages into data

Introduction: There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically? The technique to turn web pages intended for people into structured data sets intended for computers is called “screen scraping.” It has just been made easier with a wiki/community http://scraperwiki.com/ . They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It’s very good. Here’s an example of scraping road accident data or port of London ship arrivals . You can already find collections of structured data online, examples are Infochimps (“find the world’s data”), and Freebase (“An entity graph of people, places and things, built by a community that loves open data.”). There’s also a repository system for data, TheData (“An open-source application for pub

6 0.82122147 714 andrew gelman stats-2011-05-16-NYT Labs releases Openpaths, a utility for saving your iphone data

7 0.82114249 2307 andrew gelman stats-2014-04-27-Big Data…Big Deal? Maybe, if Used with Caution.

8 0.81924736 1212 andrew gelman stats-2012-03-14-Controversy about a ranking of philosophy departments, or How should we think about statistical results when we can’t see the raw data?

9 0.81694525 215 andrew gelman stats-2010-08-18-DataMarket

10 0.81474429 1289 andrew gelman stats-2012-04-29-We go to war with the data we have, not the data we want

11 0.81407475 1447 andrew gelman stats-2012-08-07-Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?

12 0.80579108 2345 andrew gelman stats-2014-05-24-An interesting mosaic of a data programming course

13 0.80343354 2084 andrew gelman stats-2013-11-01-Doing Data Science: What’s it all about?

14 0.79926145 176 andrew gelman stats-2010-08-02-Information is good

15 0.79017556 298 andrew gelman stats-2010-09-27-Who is that masked person: The use of face masks on Mexico City public transportation during the Influenza A (H1N1) outbreak

16 0.79012614 1178 andrew gelman stats-2012-02-21-How many data points do you really have?

17 0.78733176 544 andrew gelman stats-2011-01-29-Splitting the data

18 0.78071946 690 andrew gelman stats-2011-05-01-Peter Huber’s reflections on data analysis

19 0.77930689 1175 andrew gelman stats-2012-02-19-Factual – a new place to find data

20 0.77693617 1990 andrew gelman stats-2013-08-20-Job opening at an organization that promotes reproducible research!


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(4, 0.013), (9, 0.02), (15, 0.025), (16, 0.055), (21, 0.034), (24, 0.175), (43, 0.198), (54, 0.014), (86, 0.04), (95, 0.021), (99, 0.305)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.98297691 314 andrew gelman stats-2010-10-03-Disconnect between drug and medical device approval

Introduction: Sanjay Kaul wrotes: By statute (“the least burdensome” pathway), the approval standard for devices by the US FDA is lower than for drugs. Before a new drug can be marketed, the sponsor must show “substantial evidence of effectiveness” as based on two or more well-controlled clinical studies (which literally means 2 trials, each with a p value of <0.05, or 1 large trial with a robust p value <0.00125). In contrast, the sponsor of a new device, especially those that are designated as high-risk (Class III) device, need only demonstrate "substantial equivalence" to an FDA-approved device via the 510(k) exemption or a "reasonable assurance of safety and effectiveness", evaluated through a pre-market approval and typically based on a single study. What does “reasonable assurance” or “substantial equivalence” imply to you as a Bayesian? These are obviously qualitative constructs, but if one were to quantify them, how would you go about addressing it? The regulatory definitions for

2 0.97415721 1077 andrew gelman stats-2011-12-21-In which I compare “POLITICO’s chief political columnist” unfavorably to a cranky old dead guy and one of the funniest writers who’s ever lived

Introduction: Neil Malhotra writes: I just wanted to alert to this completely misinformed Politico article by Roger Simon, equating sampling theory with “magic.” Normally, I wouldn’t send you this, but I sent him a helpful email and he was a complete jerk about it. Wow—this is really bad. It’s so bad I refuse to link to it. I don’t know who this dude is, but it’s pitiful. Andy Rooney could do better. And I don’t mean Andy Rooney in his prime, I mean Andy Rooney right now. The piece appears to be an attempt at jocularity, but it’s about 10 million times worse than whatever the worst thing is that Dave Barry has ever written. My question to Neil Malhotra is . . . what made you click on this in the first place? P.S. John Sides piles on with some Gallup quotes.

3 0.96378791 1707 andrew gelman stats-2013-02-05-Glenn Hubbard and I were on opposite sides of a court case and I didn’t even know it!

Introduction: Matt Taibbi writes : Glenn Hubbard, Leading Academic and Mitt Romney Advisor, Took $1200 an Hour to Be Countrywide’s Expert Witness . . . Hidden among the reams of material recently filed in connection with the lawsuit of monoline insurer MBIA against Bank of America and Countrywide is a deposition of none other than Columbia University’s Glenn Hubbard. . . . Hubbard testified on behalf of Countrywide in the MBIA suit. He conducted an “analysis” that essentially concluded that Countrywide’s loans weren’t any worse than the loans produced by other mortgage originators, and that therefore the monstrous losses that investors in those loans suffered were due to other factors related to the economic crisis – and not caused by the serial misrepresentations and fraud in Countrywide’s underwriting. That’s interesting, because I worked on the other side of this case! I was hired by MBIA’s lawyers. It wouldn’t be polite of me to reveal my consulting rate, and I never actually got depose

4 0.95775002 1754 andrew gelman stats-2013-03-08-Cool GSS training video! And cumulative file 1972-2012!

Introduction: Felipe Osorio made the above video to help people use the General Social Survey and R to answer research questions in social science. Go for it! Meanwhile, Tom Smith reports: The initial release of the General Social Survey (GSS), cumulative file for 1972-2012 is now on our website . Codebooks and copies of questionnaires will be posted shortly. Later additional files including the GSS reinterview panels and additional variables in the cumulative file will be added. P.S. R scripts are here .

5 0.9475081 1347 andrew gelman stats-2012-05-27-Macromuddle

Introduction: More and more I feel like economics reporting is based on crude principles of adding up “good news” and “bad news.” Sometimes this makes sense: by almost any measure, an unemployment rate of 10% is bad news compared to an unemployment rate of 5%. Other times, though, the good/bad news framework seems so tangled. For example: house prices up is considered good news but inflation is considered bad news. A strong dollar is considered good news but it’s also an unfavorable exchange rate, which is bad news. When facebook shares go down, that’s bad news, but if they automatically go up, that means they were underpriced which doesn’t seem so good either. Pundits are torn between rooting for the euro to fail (which means our team (the U.S.) is better than Europe (their team)) and rooting for it to survive (because a collapse in Europe is bad news for the U.S. economy). China’s economy doing well is bad news—but if their economy slips, that’s bad news too. I think you get the picture

6 0.94239587 857 andrew gelman stats-2011-08-17-Bayes pays

7 0.93514609 601 andrew gelman stats-2011-03-05-Against double-blind reviewing: Political science and statistics are not like biology and physics

8 0.93293941 1253 andrew gelman stats-2012-04-08-Technology speedup graph

same-blog 9 0.92494655 1920 andrew gelman stats-2013-06-30-“Non-statistical” statistics tools

10 0.92097998 2330 andrew gelman stats-2014-05-12-Historical Arc of Universities

11 0.91669947 1882 andrew gelman stats-2013-06-03-The statistical properties of smart chains (and referral chains more generally)

12 0.91607225 538 andrew gelman stats-2011-01-25-Postdoc Position #2: Hierarchical Modeling and Statistical Graphics

13 0.91338331 1860 andrew gelman stats-2013-05-17-How can statisticians help psychologists do their research better?

14 0.91166723 70 andrew gelman stats-2010-06-07-Mister P goes on a date

15 0.91099966 806 andrew gelman stats-2011-07-17-6 links

16 0.90801573 481 andrew gelman stats-2010-12-22-The Jumpstart financial literacy survey and the different purposes of tests

17 0.89924043 770 andrew gelman stats-2011-06-15-Still more Mr. P in public health

18 0.89596868 22 andrew gelman stats-2010-05-07-Jenny Davidson wins Mark Van Doren Award, also some reflections on the continuity of work within literary criticism or statistics

19 0.89200526 75 andrew gelman stats-2010-06-08-“Is the cyber mob a threat to freedom?”

20 0.89138842 1956 andrew gelman stats-2013-07-25-What should be in a machine learning course?