andrew_gelman_stats andrew_gelman_stats-2011 andrew_gelman_stats-2011-788 knowledge-graph by maker-knowledge-mining

788 andrew gelman stats-2011-07-06-Early stopping and penalized likelihood


meta infos for this blog

Source: html

Introduction: Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data. This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins. In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you en


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. [sent-1, score-0.309]

2 A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. [sent-2, score-1.143]

3 In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. [sent-3, score-0.96]

4 If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data. [sent-4, score-0.908]

5 This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. [sent-5, score-0.31]

6 That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins. [sent-6, score-0.37]

7 In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you end up with something too noisy. [sent-7, score-0.357]

8 Here’s an example of what you might want: The trouble is, you don’t actually know the true value so you can’t directly use this sort of plot to make a stopping decision. [sent-8, score-0.355]

9 Everybody knows that early stopping of an iterative maximum likelihood estimate is approximately equivalent to maximum penalized likelihood estimation (that is, the posterior mode) under some prior distribution centered at the starting point. [sent-9, score-2.287]

10 By stopping early, you’re compromising between the prior and the data estimates. [sent-10, score-0.485]

11 Early stopping is a simple implementation but I prefer thinking about the posterior mode because I can better understand an algorithm if I can interpret it as optimizing some objective function. [sent-11, score-0.758]

12 For example, one appealing rule for maximum likelihood is to stop when the chi-squared discrepancy between data and fitted model is below some preset level such as its unconditional expected value. [sent-14, score-1.003]

13 Such a procedure, however, is not the same as penalized maximum likelihood with a fixed prior, as it represents a different way of setting the tuning parameter. [sent-16, score-0.694]

14 I discussed this in my very first statistics paper, “Constrained maximum entropy methods in an image reconstruction problem” (follow the link above). [sent-17, score-0.644]

15 The topic of early stopping came up in conversation not long ago and so I think this might be worth posting. [sent-18, score-0.576]

16 In 1986 or 1987 I was browsing the statistics section of the once-great Stanford bookstore and noticed (and bought) an interesting-looking book on maximum entropy and Bayesian methods. [sent-22, score-0.786]

17 Some months later I saw an announcement of a maximum entropy conference to be held in Cambridge, England, in the summer of 1988. [sent-24, score-0.798]

18 The participants at the conference were very pleasant. [sent-29, score-0.281]

19 ) The other thing I remember from the conference was that many of the participants were politically conservative. [sent-36, score-0.339]

20 Somebody else told me that the organizers of the conference were extreme Thatcherites. [sent-38, score-0.396]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('maximum', 0.401), ('stopping', 0.355), ('conference', 0.212), ('iterative', 0.2), ('likelihood', 0.187), ('entropy', 0.185), ('starting', 0.166), ('jaynes', 0.159), ('early', 0.154), ('stop', 0.138), ('algorithm', 0.112), ('penalized', 0.106), ('hal', 0.103), ('advice', 0.091), ('mode', 0.09), ('appealing', 0.088), ('trick', 0.086), ('went', 0.083), ('objective', 0.081), ('late', 0.075), ('prior', 0.072), ('browsing', 0.071), ('preset', 0.071), ('talk', 0.071), ('fields', 0.07), ('participants', 0.069), ('extreme', 0.069), ('stephan', 0.067), ('naval', 0.067), ('bookstore', 0.067), ('photographs', 0.067), ('respecting', 0.067), ('came', 0.067), ('primaries', 0.064), ('deming', 0.064), ('yielding', 0.064), ('section', 0.062), ('politics', 0.062), ('optimizing', 0.062), ('discrepancy', 0.06), ('civilian', 0.06), ('varieties', 0.06), ('reconstruction', 0.058), ('organizers', 0.058), ('margins', 0.058), ('sessions', 0.058), ('posterior', 0.058), ('remember', 0.058), ('data', 0.058), ('told', 0.057)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000004 788 andrew gelman stats-2011-07-06-Early stopping and penalized likelihood

Introduction: Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data. This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins. In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you en

2 0.24453118 2210 andrew gelman stats-2014-02-13-Stopping rules and Bayesian analysis

Introduction: I happened to receive two questions about stopping rules on the same day. First, from Tom Cunningham: I’ve been arguing with my colleagues about whether the stopping rule is relevant (a presenter disclosed that he went out to collect more data because the first experiment didn’t get significant results) — and I believe you have some qualifications to the Bayesian irrelevance argument but I don’t properly understand them. Then, from Benjamin Kay: I have a question that may be of interest for your blog. I was reading about the early history of AIDS and learned that the the trial of AZT was ended early because it was so effective : The trial reported in the New England Journal of medicine, had produced a dramatic result. Before the planned 24 week duration of the study, after a mean period of participation of about 120 days, nineteen participants receiving placebo had died while there was only a single death among those receiving AZT. This appeared to be a momentous break

3 0.22726306 442 andrew gelman stats-2010-12-01-bayesglm in Stata?

Introduction: Is there an implementation of bayesglm in Stata? (That is, approximate maximum penalized likelihood estimation with specified normal or t prior distributions on the coefficients.)

4 0.20592004 779 andrew gelman stats-2011-06-25-Avoiding boundary estimates using a prior distribution as regularization

Introduction: For awhile I’ve been fitting most of my multilevel models using lmer/glmer, which gives point estimates of the group-level variance parameters (maximum marginal likelihood estimate for lmer and an approximation for glmer). I’m usually satisfied with this–sure, point estimation understates the uncertainty in model fitting, but that’s typically the least of our worries. Sometimes, though, lmer/glmer estimates group-level variances at 0 or estimates group-level correlation parameters at +/- 1. Typically, when this happens, it’s not that we’re so sure the variance is close to zero or that the correlation is close to 1 or -1; rather, the marginal likelihood does not provide a lot of information about these parameters of the group-level error distribution. I don’t want point estimates on the boundary. I don’t want to say that the unexplained variance in some dimension is exactly zero. One way to handle this problem is full Bayes: slap a prior on sigma, do your Gibbs and Metropolis

5 0.20448323 1560 andrew gelman stats-2012-11-03-Statistical methods that work in some settings but not others

Introduction: David Hogg pointed me to this post by Larry Wasserman: 1. The Horwitz-Thompson estimator    satisfies the following condition: for every   , where   — the parameter space — is the set of all functions  . (There are practical improvements to the Horwitz-Thompson estimator that we discussed in our earlier posts but we won’t revisit those here.) 2. A Bayes estimator requires a prior   for  . In general, if   is not a function of   then (1) will not hold. . . . 3. If you let   be a function if  , (1) still, in general, does not hold. 4. If you make   a function if   in just the right way, then (1) will hold. . . . There is nothing wrong with doing this, but in our opinion this is not in the spirit of Bayesian inference. . . . 7. This example is only meant to show that Bayesian estimators do not necessarily have good frequentist properties. This should not be surprising. There is no reason why we should in general expect a Bayesian method to have a frequentist property

6 0.16167679 247 andrew gelman stats-2010-09-01-How does Bayes do it?

7 0.14520605 774 andrew gelman stats-2011-06-20-The pervasive twoishness of statistics; in particular, the “sampling distribution” and the “likelihood” are two different models, and that’s a good thing

8 0.12490034 72 andrew gelman stats-2010-06-07-Valencia: Summer of 1991

9 0.12265727 1247 andrew gelman stats-2012-04-05-More philosophy of Bayes

10 0.12150051 2208 andrew gelman stats-2014-02-12-How to think about “identifiability” in Bayesian inference?

11 0.11758319 519 andrew gelman stats-2011-01-16-Update on the generalized method of moments

12 0.11068365 846 andrew gelman stats-2011-08-09-Default priors update?

13 0.10682081 398 andrew gelman stats-2010-11-06-Quote of the day

14 0.10228598 2200 andrew gelman stats-2014-02-05-Prior distribution for a predicted probability

15 0.099805333 2007 andrew gelman stats-2013-09-03-Popper and Jaynes

16 0.098985411 1046 andrew gelman stats-2011-12-07-Neutral noninformative and informative conjugate beta and gamma prior distributions

17 0.098443672 350 andrew gelman stats-2010-10-18-Subtle statistical issues to be debated on TV.

18 0.097612791 2201 andrew gelman stats-2014-02-06-Bootstrap averaging: Examples where it works and where it doesn’t work

19 0.097224616 1149 andrew gelman stats-2012-02-01-Philosophy of Bayesian statistics: my reactions to Cox and Mayo

20 0.097142577 109 andrew gelman stats-2010-06-25-Classics of statistics


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.213), (1, 0.06), (2, -0.027), (3, 0.054), (4, -0.017), (5, -0.001), (6, 0.051), (7, -0.003), (8, -0.049), (9, -0.034), (10, 0.008), (11, -0.008), (12, 0.028), (13, -0.014), (14, -0.008), (15, -0.041), (16, -0.007), (17, -0.029), (18, 0.033), (19, -0.013), (20, 0.004), (21, -0.035), (22, 0.043), (23, -0.003), (24, -0.002), (25, 0.012), (26, -0.029), (27, -0.046), (28, 0.058), (29, -0.003), (30, -0.038), (31, 0.026), (32, 0.01), (33, -0.004), (34, 0.035), (35, -0.037), (36, -0.023), (37, -0.002), (38, 0.006), (39, -0.002), (40, 0.038), (41, 0.033), (42, -0.003), (43, 0.032), (44, 0.054), (45, -0.002), (46, 0.022), (47, 0.004), (48, 0.005), (49, 0.046)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95819271 788 andrew gelman stats-2011-07-06-Early stopping and penalized likelihood

Introduction: Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data. This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins. In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you en

2 0.78900886 1769 andrew gelman stats-2013-03-18-Tibshirani announces new research result: A significance test for the lasso

Introduction: Lasso and me For a long time I was wrong about lasso. Lasso (“least absolute shrinkage and selection operator”) is a regularization procedure that shrinks regression coefficients toward zero, and in its basic form is equivalent to maximum penalized likelihood estimation with a penalty function that is proportional to the sum of the absolute values of the regression coefficients. I first heard about lasso from a talk that Trevor Hastie Rob Tibshirani gave at Berkeley in 1994 or 1995. He demonstrated that it shrunk regression coefficients to zero. I wasn’t impressed, first because it seemed like no big deal (if that’s the prior you use, that’s the shrinkage you get) and second because, from a Bayesian perspective, I don’t want to shrink things all the way to zero. In the sorts of social and environmental science problems I’ve worked on, just about nothing is zero. I’d like to control my noisy estimates but there’s nothing special about zero. At the end of the talk I stood

3 0.73892379 1955 andrew gelman stats-2013-07-25-Bayes-respecting experimental design and other things

Introduction: Dan Lakeland writes: I have some questions about some basic statistical ideas and would like your opinion on them: 1) Parameters that manifestly DON’T exist: It makes good sense to me to think about Bayesian statistics as narrowing in on the value of parameters based on a model and some data. But there are cases where “the parameter” simply doesn’t make sense as an actual thing. Yet, it’s not really a complete fiction, like unicorns either, it’s some kind of “effective” thing maybe. Here’s an example of what I mean. I did a simple toy experiment where we dropped crumpled up balls of paper and timed their fall times. (see here: http://models.street-artists.org/?s=falling+ball ) It was pretty instructive actually, and I did it to figure out how to in a practical way use an ODE to get a likelihood in MCMC procedures. One of the parameters in the model is the radius of the spherical ball of paper. But the ball of paper isn’t a sphere, not even approximately. There’s no single valu

4 0.7297172 1422 andrew gelman stats-2012-07-20-Likelihood thresholds and decisions

Introduction: David Hogg points me to this discussion: Martin Strasbourg and I [Hogg] discussed his project to detect new satellites of M31 in the PAndAS survey. He can construct a likelihood ratio (possibly even a marginalized likelihood ratio) at every position in the M31 imaging, between the best-fit satellite-plus-background model and the best nothing-plus-background model. He can make a two-dimensional map of these likelihood ratios and show a the histogram of them. Looking at this histogram, which has a tail to very large ratios, he asked me, where should I put my cut? That is, at what likelihood ratio does a candidate deserve follow-up? Here’s my unsatisfying answer: To a statistician, the distribution of likelihood ratios is interesting and valuable to study. To an astronomer, it is uninteresting. You don’t want to know the distribution of likelihoods, you want to find satellites . . . I wrote that I think this makes sense and that it would actualy be an interesting and useful rese

5 0.71759337 519 andrew gelman stats-2011-01-16-Update on the generalized method of moments

Introduction: After reading all the comments here I remembered that I’ve actually written a paper on the generalized method of moments–including the bit about maximum likelihood being a special case. The basic idea is simple enough that it must have been rediscovered dozens of times by different people (sort of like the trapezoidal rule ). In our case, we were motivated to (independently) develop the (well-known, but not by me) generalized method of moments as a way of specifying an indirectly-parameterized prior distribution, rather than as a way of estimating parameters from direct data. But the math is the same.

6 0.70942605 1520 andrew gelman stats-2012-10-03-Advice that’s so eminently sensible but so difficult to follow

7 0.70786858 1560 andrew gelman stats-2012-11-03-Statistical methods that work in some settings but not others

8 0.70565903 833 andrew gelman stats-2011-07-31-Untunable Metropolis

9 0.70251679 776 andrew gelman stats-2011-06-22-Deviance, DIC, AIC, cross-validation, etc

10 0.69500804 1309 andrew gelman stats-2012-05-09-The first version of my “inference from iterative simulation using parallel sequences” paper!

11 0.69292039 984 andrew gelman stats-2011-11-01-David MacKay sez . . . 12??

12 0.69115901 507 andrew gelman stats-2011-01-07-Small world: MIT, asymptotic behavior of differential-difference equations, Susan Assmann, subgroup analysis, multilevel modeling

13 0.68082184 2311 andrew gelman stats-2014-04-29-Bayesian Uncertainty Quantification for Differential Equations!

14 0.68037742 1673 andrew gelman stats-2013-01-15-My talk last night at the visualization meetup

15 0.68003535 1518 andrew gelman stats-2012-10-02-Fighting a losing battle

16 0.67936754 2208 andrew gelman stats-2014-02-12-How to think about “identifiability” in Bayesian inference?

17 0.67840832 1941 andrew gelman stats-2013-07-16-Priors

18 0.6776759 1157 andrew gelman stats-2012-02-07-Philosophy of Bayesian statistics: my reactions to Hendry

19 0.67575777 669 andrew gelman stats-2011-04-19-The mysterious Gamma (1.4, 0.4)

20 0.67548001 1750 andrew gelman stats-2013-03-05-Watership Down, thick description, applied statistics, immutability of stories, and playing tennis with a net


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.011), (3, 0.014), (5, 0.017), (15, 0.023), (16, 0.084), (21, 0.023), (24, 0.185), (34, 0.023), (44, 0.073), (45, 0.022), (55, 0.02), (59, 0.011), (63, 0.01), (77, 0.021), (85, 0.012), (86, 0.035), (89, 0.012), (96, 0.017), (99, 0.248)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97685939 693 andrew gelman stats-2011-05-04-Don’t any statisticians work for the IRS?

Introduction: A friend asks the above question and writes: This article left me thinking – how could the IRS not notice that this guy didn’t file taxes for several years? Don’t they run checks and notice if you miss a year? If I write a check our of order, there’s an asterisk next to the check number in my next bank statement showing that there was a gap in the sequence. If you ran the IRS, wouldn’t you do this: SSNs are issued sequentially. Once a SSN reaches 18, expect it to file a return. If it doesn’t, mail out a postage paid letter asking why not with check boxes such as Student, Unemployed, etc. Follow up at reasonable intervals. Eventually every SSN should be filing a return, or have an international address. Yes this is intrusive, but my goal is only to maximize tax revenue. Surely people who do this for a living could come up with something more elegant. My response: I dunno, maybe some confidentiality rules? The other thing is that I’m guessing that IRS gets lots of pushback w

same-blog 2 0.97239602 788 andrew gelman stats-2011-07-06-Early stopping and penalized likelihood

Introduction: Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data. This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins. In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you en

3 0.95889282 1627 andrew gelman stats-2012-12-17-Stan and RStan 1.1.0

Introduction: We’re happy to announce the availability of Stan and RStan versions 1.1.0, which are general tools for performing model-based Bayesian inference using the no-U-turn sampler, an adaptive form of Hamiltonian Monte Carlo. Information on downloading and installing and using them is available as always from Stan Home Page: http://mc-stan.org/ Let us know if you have any problems on the mailing lists or at the e-mails linked on the home page (please don’t use this web page). The full release notes follow. (R)Stan Version 1.1.0 Release Notes =================================== -- Backward Compatibility Issue * Categorical distribution recoded to match documentation; it now has support {1,...,K} rather than {0,...,K-1}. * (RStan) change default value of permuted flag from FALSE to TRUE for Stan fit S4 extract() method -- New Features * Conditional (if-then-else) statements * While statements -- New Functions * generalized multiply_lower_tri

4 0.95596588 2089 andrew gelman stats-2013-11-04-Shlemiel the Software Developer and Unknown Unknowns

Introduction: The Stan meeting today reminded me of Joel Spolsky’s recasting of the Yiddish joke about Shlemiel the Painter. Joel retold it on his blog, Joel on Software , in the post Back to Basics : Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck. The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck. The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?” “I can’t help it,” says Shlemiel. “Every day I get farther and farther away from the paint can!” Joel used it as an example of the kind of string processing naive programmers ar

5 0.95404005 1713 andrew gelman stats-2013-02-08-P-values and statistical practice

Introduction: From my new article in the journal Epidemiology: Sander Greenland and Charles Poole accept that P values are here to stay but recognize that some of their most common interpretations have problems. The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings (as discussed, for example, by Greenland in 2011). The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations). A Bayesian interpretation based on a spike-and-slab model makes little sense in applied contexts in epidemiology, political science, and other fields in which true effects are typically nonzero and bounded (thus violating both the “spike” and the “slab” parts of the model). I find Greenland and Poole’s perspective t

6 0.95351231 2161 andrew gelman stats-2014-01-07-My recent debugging experience

7 0.95346498 807 andrew gelman stats-2011-07-17-Macro causality

8 0.95315862 1176 andrew gelman stats-2012-02-19-Standardized writing styles and standardized graphing styles

9 0.9522661 1367 andrew gelman stats-2012-06-05-Question 26 of my final exam for Design and Analysis of Sample Surveys

10 0.95224541 2086 andrew gelman stats-2013-11-03-How best to compare effects measured in two different time periods?

11 0.95169204 1240 andrew gelman stats-2012-04-02-Blogads update

12 0.95137405 2149 andrew gelman stats-2013-12-26-Statistical evidence for revised standards

13 0.95110285 1117 andrew gelman stats-2012-01-13-What are the important issues in ethics and statistics? I’m looking for your input!

14 0.9507603 1206 andrew gelman stats-2012-03-10-95% intervals that I don’t believe, because they’re from a flat prior I don’t believe

15 0.95068228 2305 andrew gelman stats-2014-04-25-Revised statistical standards for evidence (comments to Val Johnson’s comments on our comments on Val’s comments on p-values)

16 0.95058423 2150 andrew gelman stats-2013-12-27-(R-Py-Cmd)Stan 2.1.0

17 0.95052946 2121 andrew gelman stats-2013-12-02-Should personal genetic testing be regulated? Battle of the blogroll

18 0.95038855 2112 andrew gelman stats-2013-11-25-An interesting but flawed attempt to apply general forecasting principles to contextualize attitudes toward risks of global warming

19 0.94951987 898 andrew gelman stats-2011-09-10-Fourteen magic words: an update

20 0.94939876 1879 andrew gelman stats-2013-06-01-Benford’s law and addresses