andrew_gelman_stats andrew_gelman_stats-2013 andrew_gelman_stats-2013-1718 knowledge-graph by maker-knowledge-mining

1718 andrew gelman stats-2013-02-11-Toward a framework for automatic model building

meta infos for this blog

Source: html

Introduction: Patrick Caldon writes: I saw your recent blog post where you discussed in passing an iterative-chain-of models approach to AI. I essentially built such a thing for my PhD thesis – not in a Bayesian context, but in a logic programming context – and proved it had a few properties and showed how you could solve some toy problems. The important bit of my framework was that at various points you also go and get more data in the process – in a statistical context this might be seen as building a little univariate model on a subset of the data, then iteratively extending into a better model with more data and more independent variables – a generalized forward stepwise regression if you like. It wrapped a proper computational framework around E.M. Gold’s identification/learning in the limit based on a logic my advisor (Eric Martin) had invented. What’s not written up in the thesis is a few months of failed struggle trying to shoehorn some simple statistical inference into this

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Patrick Caldon writes: I saw your recent blog post where you discussed in passing an iterative-chain-of models approach to AI. [sent-1, score-0.094]

2 I essentially built such a thing for my PhD thesis – not in a Bayesian context, but in a logic programming context – and proved it had a few properties and showed how you could solve some toy problems. [sent-2, score-1.193]

3 It wrapped a proper computational framework around E. [sent-4, score-0.714]

4 Gold’s identification/learning in the limit based on a logic my advisor (Eric Martin) had invented. [sent-6, score-0.538]

5 What’s not written up in the thesis is a few months of failed struggle trying to shoehorn some simple statistical inference into this framework with decent computational properties! [sent-7, score-1.077]

6 I had a good crack with a few different ideas and didn’t really get anywhere, and worse I couldn’t say much in the end about why it seemed to be hard. [sent-8, score-0.199]

7 I’ve now moved onto different things (indeed, moved on from logic in academia into statistics in finance) but I thought you might it interesting to see this problem analysed from a different perspective. [sent-10, score-1.029]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('framework', 0.352), ('logic', 0.266), ('straightforward', 0.2), ('gold', 0.197), ('tree', 0.174), ('properties', 0.174), ('thesis', 0.165), ('context', 0.164), ('limit', 0.158), ('moved', 0.14), ('computational', 0.139), ('valued', 0.138), ('iteratively', 0.138), ('analysed', 0.138), ('iff', 0.138), ('shoehorn', 0.138), ('wrapped', 0.138), ('cart', 0.13), ('stepwise', 0.12), ('crack', 0.117), ('advisor', 0.114), ('identifiable', 0.111), ('computationally', 0.109), ('decent', 0.107), ('univariate', 0.107), ('patrick', 0.105), ('toy', 0.105), ('extending', 0.1), ('finance', 0.097), ('terrible', 0.097), ('struggle', 0.097), ('implementing', 0.096), ('passing', 0.094), ('onto', 0.091), ('academia', 0.09), ('anywhere', 0.09), ('martin', 0.088), ('proved', 0.087), ('subset', 0.086), ('phd', 0.085), ('proper', 0.085), ('generalized', 0.085), ('different', 0.082), ('built', 0.081), ('failed', 0.079), ('eric', 0.079), ('programming', 0.076), ('collection', 0.076), ('showed', 0.075), ('data', 0.075)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 1718 andrew gelman stats-2013-02-11-Toward a framework for automatic model building

2 0.11960936 1418 andrew gelman stats-2012-07-16-Long discussion about causal inference and the use of hierarchical models to bridge between different inferential settings

Introduction: Elias Bareinboim asked what I thought about his comment on selection bias in which he referred to a paper by himself and Judea Pearl, “Controlling Selection Bias in Causal Inference.” I replied that I have no problem with what he wrote, but that from my perspective I find it easier to conceptualize such problems in terms of multilevel models. I elaborated on that point in a recent post , “Hierarchical modeling as a framework for extrapolation,” which I think was read by only a few people (I say this because it received only two comments). I don’t think Bareinboim objected to anything I wrote, but like me he is comfortable working within his own framework. He wrote the following to me: In some sense, “not ad hoc” could mean logically consistent. In other words, if one agrees with the assumptions encoded in the model, one must also agree with the conclusions entailed by these assumptions. I am not aware of any other way of doing mathematics. As it turns out, to get causa

3 0.10643578 2357 andrew gelman stats-2014-06-02-Why we hate stepwise regression

Introduction: Haynes Goddard writes: I have been slowly working my way through the grad program in stats here, and the latest course was a biostats course on categorical and survival analysis. I noticed in the semi-parametric and parametric material (Wang and Lee is the text) that they use stepwise regression a lot. I learned in econometrics that stepwise is poor practice, as it defaults to the “theory of the regression line”, that is no theory at all, just the variation in the data. I don’t find the topic on your blog, and wonder if you have addressed the issue. My reply: Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticans but are considered by statisticians to be a bit of a joke. For example, Jennifer and I don’t mention stepwise regression in our book, not even once. To address the issue more directly: the motivation behind stepwise regression is that you have a lot of potential predictors but not e

4 0.098613091 524 andrew gelman stats-2011-01-19-Data exploration and multiple comparisons

Introduction: Bill Harris writes: I’ve read your paper and presentation showing why you don’t usually worry about multiple comparisons. I see how that applies when you are comparing results across multiple settings (states, etc.). Does the same principle hold when you are exploring data to find interesting relationships? For example, you have some data, and you’re trying a series of models to see which gives you the most useful insight. Do you try your models on a subset of the data so you have another subset for confirmatory analysis later, or do you simply throw all the data against your models? My reply: I’d like to estimate all the relationships at once and use a multilevel model to do partial pooling to handle the mutiplicity issues. That said, in practice, in my applied work I’m always bouncing back and forth between different hypotheses and different datasets, and often I learn a lot when next year’s data come in and I can modify my hypotheses. The trouble with the classical

5 0.096651159 1469 andrew gelman stats-2012-08-25-Ways of knowing

Introduction: In this discussion from last month, computer science student and Judea Pearl collaborator Elias Barenboim expressed an attitude that hierarchical Bayesian methods might be fine in practice but that they lack theory, that Bayesians can’t succeed in toy problems. I posted a P.S. there which might not have been noticed so I will put it here: I now realize that there is some disagreement about what constitutes a “guarantee.” In one of his comments, Barenboim writes, “the assurance we have that the result must hold as long as the assumptions in the model are correct should be regarded as a guarantee.” In that sense, yes, we have guarantees! It is fundamental to Bayesian inference that the result must hold if the assumptions in the model are correct. We have lots of that in Bayesian Data Analysis (particularly in the first four chapters but implicitly elsewhere as well), and this is also covered in the classic books by Lindley, Jaynes, and others. This sort of guarantee is indeed p

6 0.095543668 1482 andrew gelman stats-2012-09-04-Model checking and model understanding in machine learning

7 0.094040014 1205 andrew gelman stats-2012-03-09-Coming to agreement on philosophy of statistics

8 0.093387268 1948 andrew gelman stats-2013-07-21-Bayes related

9 0.091564469 781 andrew gelman stats-2011-06-28-The holes in my philosophy of Bayesian data analysis

10 0.087139688 774 andrew gelman stats-2011-06-20-The pervasive twoishness of statistics; in particular, the “sampling distribution” and the “likelihood” are two different models, and that’s a good thing

11 0.085871056 1383 andrew gelman stats-2012-06-18-Hierarchical modeling as a framework for extrapolation

12 0.084434181 2173 andrew gelman stats-2014-01-15-Postdoc involving pathbreaking work in MRP, Stan, and the 2014 election!

13 0.08156047 214 andrew gelman stats-2010-08-17-Probability-processing hardware

14 0.081174426 1740 andrew gelman stats-2013-02-26-“Is machine learning a subset of statistics?”

15 0.079507485 353 andrew gelman stats-2010-10-19-The violent crime rate was about 75% higher in Detroit than in Minneapolis in 2009

16 0.079248898 901 andrew gelman stats-2011-09-12-Some thoughts on academic cheating, inspired by Frey, Wegman, Fischer, Hauser, Stapel

17 0.078339085 1739 andrew gelman stats-2013-02-26-An AI can build and try out statistical models using an open-ended generative grammar

18 0.077584878 1535 andrew gelman stats-2012-10-16-Bayesian analogue to stepwise regression?

19 0.07740555 458 andrew gelman stats-2010-12-08-Blogging: Is it “fair use”?

20 0.076161571 1695 andrew gelman stats-2013-01-28-Economists argue about Bayes

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.16), (1, 0.057), (2, -0.046), (3, 0.032), (4, 0.015), (5, 0.001), (6, -0.035), (7, 0.001), (8, 0.056), (9, 0.013), (10, 0.0), (11, 0.02), (12, -0.012), (13, -0.011), (14, -0.008), (15, 0.016), (16, 0.013), (17, -0.015), (18, -0.007), (19, 0.012), (20, 0.001), (21, -0.001), (22, -0.023), (23, 0.023), (24, 0.019), (25, 0.026), (26, 0.006), (27, -0.048), (28, 0.01), (29, 0.007), (30, 0.057), (31, 0.0), (32, -0.011), (33, -0.0), (34, 0.024), (35, 0.003), (36, -0.008), (37, -0.015), (38, -0.005), (39, -0.001), (40, 0.01), (41, 0.034), (42, 0.016), (43, 0.008), (44, 0.001), (45, -0.007), (46, 0.016), (47, 0.023), (48, -0.002), (49, -0.044)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9575398 1718 andrew gelman stats-2013-02-11-Toward a framework for automatic model building

2 0.82266033 421 andrew gelman stats-2010-11-19-Just chaid

Introduction: Reading somebody else’s statistics rant made me realize the inherent contradictions in much of my own statistical advice. Jeff Lax sent along this article by Philip Schrodt, along with the cryptic comment: Perhaps of interest to you. perhaps not. Not meant to be an excuse for you to rant against hypothesis testing again. In his article, Schrodt makes a reasonable and entertaining argument against the overfitting of data and the overuse of linear models. He states that his article is motivated by the quantitative papers he has been sent to review for journals or conferences, and he explicitly excludes “studies of United States voting behavior,” so at least I think Mister P is off the hook. I notice a bit of incoherence in Schrodt’s position–on one hand, he criticizes “kitchen-sink models” for overfitting and he criticizes “using complex methods without understanding the underlying assumptions” . . . but then later on he suggests that political scientists in this countr

3 0.81715369 1482 andrew gelman stats-2012-09-04-Model checking and model understanding in machine learning

Introduction: Last month I wrote : Computer scientists are often brilliant but they can be unfamiliar with what is done in the worlds of data collection and analysis. This goes the other way too: statisticians such as myself can look pretty awkward, reinventing (or failing to reinvent) various wheels when we write computer programs or, even worse, try to design software.Andrew MacNamara writes: Andrew MacNamara followed up with some thoughts: I [MacNamara] had some basic statistics training through my MBA program, after having completed an undergrad degree in computer science. Since then I’ve been very interested in learning more about statistical techniques, including things like GLM and censored data analyses as well as machine learning topics like neural nets, SVMs, etc. I began following your blog after some research into Bayesian analysis topics and I am trying to dig deeper on that side of things. One thing I have noticed is that there seems to be a distinction between data analysi

4 0.81110859 244 andrew gelman stats-2010-08-30-Useful models, model checking, and external validation: a mini-discussion

Introduction: I sent a copy of my paper (coauthored with Cosma Shalizi) on Philosophy and the practice of Bayesian statistics in the social sciences to Richard Berk , who wrote: I read your paper this morning. I think we are pretty much on the same page about all models being wrong. I like very much the way you handle this in the paper. Yes, Newton’s work is wrong, but surely useful. I also like your twist on Bayesian methods. Makes good sense to me. Perhaps most important, your paper raises some difficult issues I have been trying to think more carefully about. 1. If the goal of a model is to be useful, surely we need to explore that “useful” means. At the very least, usefulness will depend on use. So a model that is useful for forecasting may or may not be useful for causal inference. 2. Usefulness will be a matter of degree. So that for each use we will need one or more metrics to represent how useful the model is. In what looks at first to be simple example, if the use is forecasting,

5 0.78153282 1739 andrew gelman stats-2013-02-26-An AI can build and try out statistical models using an open-ended generative grammar

Introduction: David Duvenaud writes: I’ve been following your recent discussions about how an AI could do statistics [see also here ]. I was especially excited about your suggestion for new statistical methods using “a language-like approach to recursively creating new models from a specified list of distributions and transformations, and an automatic approach to checking model fit.” Your discussion of these ideas was exciting to me and my colleagues because we recently did some work taking a step in this direction, automatically searching through a grammar over Gaussian process regression models. Roger Grosse previously did the same thing , but over matrix decomposition models using held-out predictive likelihood to check model fit. These are both examples of automatic Bayesian model-building by a search over more and more complex models, as you suggested. One nice thing is that both grammars include lots of standard models for free, and they seem to work pretty well, although the

6 0.77575994 1769 andrew gelman stats-2013-03-18-Tibshirani announces new research result: A significance test for the lasso

7 0.77472073 690 andrew gelman stats-2011-05-01-Peter Huber’s reflections on data analysis

8 0.77400768 1763 andrew gelman stats-2013-03-14-Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses

9 0.76754904 1418 andrew gelman stats-2012-07-16-Long discussion about causal inference and the use of hierarchical models to bridge between different inferential settings

10 0.76158798 1292 andrew gelman stats-2012-05-01-Colorless green facts asserted resolutely

11 0.75485229 1156 andrew gelman stats-2012-02-06-Bayesian model-building by pure thought: Some principles and examples

12 0.75000352 2176 andrew gelman stats-2014-01-19-Transformations for non-normal data

13 0.74740624 789 andrew gelman stats-2011-07-07-Descriptive statistics, causal inference, and story time

14 0.74688798 1742 andrew gelman stats-2013-02-27-What is “explanation”?

15 0.74684078 1469 andrew gelman stats-2012-08-25-Ways of knowing

16 0.74680918 101 andrew gelman stats-2010-06-20-“People with an itch to scratch”

17 0.74529797 10 andrew gelman stats-2010-04-29-Alternatives to regression for social science predictions

18 0.74328983 524 andrew gelman stats-2011-01-19-Data exploration and multiple comparisons

19 0.74295413 1529 andrew gelman stats-2012-10-11-Bayesian brains?

20 0.74254191 1740 andrew gelman stats-2013-02-26-“Is machine learning a subset of statistics?”

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(15, 0.011), (16, 0.07), (24, 0.122), (37, 0.014), (42, 0.013), (61, 0.012), (81, 0.012), (86, 0.375), (87, 0.012), (99, 0.264)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.9844051 1427 andrew gelman stats-2012-07-24-More from the sister blog

Introduction: Anthropologist Bruce Mannheim reports that a recent well-publicized study on the genetics of native Americans, which used genetic analysis to find “at least three streams of Asian gene flow,” is in fact a confirmation of a long-known fact. Mannheim writes: This three-way distinction was known linguistically since the 1920s (for example, Sapir 1921). Basically, it’s a division among the Eskimo-Aleut languages, which straddle the Bering Straits even today, the Athabaskan languages (which were discovered to be related to a small Siberian language family only within the last few years, not by Greenberg as Wade suggested), and everything else. This is not to say that the results from genetics are unimportant, but it’s good to see how it fits with other aspects of our understanding.

2 0.97320557 1530 andrew gelman stats-2012-10-11-Migrating your blog from Movable Type to WordPress

Introduction: Cord Blomquist, who did a great job moving us from horrible Movable Type to nice nice WordPress, writes: I [Cord] wanted to share a little news with you related to the original work we did for you last year. When ReadyMadeWeb converted your Movable Type blog to WordPress, we got a lot of other requestes for the same service, so we started thinking about a bigger market for such a product. After a bit of research, we started work on automating the data conversion, writing rules, and exceptions to the rules, on how Movable Type and TypePad data could be translated to WordPress. After many months of work, we’re getting ready to announce TP2WP.com , a service that converts Movable Type and TypePad export files to WordPress import files, so anyone who wants to migrate to WordPress can do so easily and without losing permalinks, comments, images, or other files. By automating our service, we’ve been able to drop the price to just $99. I recommend it (and, no, Cord is not paying m

3 0.95483756 873 andrew gelman stats-2011-08-26-Luck or knowledge?

Introduction: Joan Ginther has won the Texas lottery four times. First, she won $5.4 million, then a decade later, she won $2million, then two years later $3million and in the summer of 2010, she hit a $10million jackpot. The odds of this has been calculated at one in eighteen septillion and luck like this could only come once every quadrillion years. According to Forbes, the residents of Bishop, Texas, seem to believe God was behind it all. The Texas Lottery Commission told Mr Rich that Ms Ginther must have been ‘born under a lucky star’, and that they don’t suspect foul play. Harper’s reporter Nathanial Rich recently wrote an article about Ms Ginther, which calls the the validity of her ‘luck’ into question. First, he points out, Ms Ginther is a former math professor with a PhD from Stanford University specialising in statistics. More at Daily Mail. [Edited Saturday] In comments, C Ryan King points to the original article at Harper’s and Bill Jefferys to Wired .

4 0.95273256 558 andrew gelman stats-2011-02-05-Fattening of the world and good use of the alpha channel

Introduction: In the spirit of Gapminder , Washington Post created an interactive scatterplot viewer that’s using alpha channel to tell apart overlapping fat dots better than sorting-by-circle-size Gapminder is using: Good news: the rate of fattening of the USA appears to be slowing down. Maybe because of high gas prices? But what’s happening with Oceania?

5 0.9401412 253 andrew gelman stats-2010-09-03-Gladwell vs Pinker

Introduction: I just happened to notice this from last year. Eric Loken writes : Steven Pinker reviewed Malcolm Gladwell’s latest book and criticized him rather harshly for several shortcomings. Gladwell appears to have made things worse for himself in a letter to the editor of the NYT by defending a manifestly weak claim from one of his essays – the claim that NFL quarterback performance is unrelated to the order they were drafted out of college. The reason w [Loken and his colleagues] are implicated is that Pinker identified an earlier blog post of ours as one of three sources he used to challenge Gladwell (yay us!). But Gladwell either misrepresented or misunderstood our post in his response, and admonishes Pinker by saying “we should agree that our differences owe less to what can be found in the scientific literature than they do to what can be found on Google.” Well, here’s what you can find on Google. Follow this link to request the data for NFL quarterbacks drafted between 1980 and

same-blog 6 0.93098778 1718 andrew gelman stats-2013-02-11-Toward a framework for automatic model building

7 0.92867345 904 andrew gelman stats-2011-09-13-My wikipedia edit

8 0.92690408 76 andrew gelman stats-2010-06-09-Both R and Stata

9 0.92255467 2219 andrew gelman stats-2014-02-21-The world’s most popular languages that the Mac documentation hasn’t been translated into

10 0.91410559 436 andrew gelman stats-2010-11-29-Quality control problems at the New York Times

11 0.89425886 1547 andrew gelman stats-2012-10-25-College football, voting, and the law of large numbers

12 0.88830519 1552 andrew gelman stats-2012-10-29-“Communication is a central task of statistics, and ideally a state-of-the-art data analysis can have state-of-the-art displays to match”

13 0.87720442 1327 andrew gelman stats-2012-05-18-Comments on “A Bayesian approach to complex clinical diagnoses: a case-study in child abuse”

14 0.86204016 759 andrew gelman stats-2011-06-11-“2 level logit with 2 REs & large sample. computational nightmare – please help”

15 0.85552931 305 andrew gelman stats-2010-09-29-Decision science vs. social psychology

16 0.8540355 2082 andrew gelman stats-2013-10-30-Berri Gladwell Loken football update

17 0.85171109 276 andrew gelman stats-2010-09-14-Don’t look at just one poll number–unless you really know what you’re doing!

18 0.82471347 1971 andrew gelman stats-2013-08-07-I doubt they cheated

19 0.82370728 1586 andrew gelman stats-2012-11-21-Readings for a two-week segment on Bayesian modeling?

20 0.82320464 1278 andrew gelman stats-2012-04-23-“Any old map will do” meets “God is in every leaf of every tree”