Introduction: Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data. This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins. In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you en

1 Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. [sent-1, score-0.309]

2 A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. [sent-2, score-1.143]

3 In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. [sent-3, score-0.96]

4 If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data. [sent-4, score-0.908]

5 This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. [sent-5, score-0.31]

6 That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins. [sent-6, score-0.37]

7 In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you end up with something too noisy. [sent-7, score-0.357]

8 Here’s an example of what you might want: The trouble is, you don’t actually know the true value so you can’t directly use this sort of plot to make a stopping decision. [sent-8, score-0.355]

9 Everybody knows that early stopping of an iterative maximum likelihood estimate is approximately equivalent to maximum penalized likelihood estimation (that is, the posterior mode) under some prior distribution centered at the starting point. [sent-9, score-2.287]

10 By stopping early, you’re compromising between the prior and the data estimates. [sent-10, score-0.485]

11 Early stopping is a simple implementation but I prefer thinking about the posterior mode because I can better understand an algorithm if I can interpret it as optimizing some objective function. [sent-11, score-0.758]

12 For example, one appealing rule for maximum likelihood is to stop when the chi-squared discrepancy between data and fitted model is below some preset level such as its unconditional expected value. [sent-14, score-1.003]

13 Such a procedure, however, is not the same as penalized maximum likelihood with a fixed prior, as it represents a different way of setting the tuning parameter. [sent-16, score-0.694]

14 I discussed this in my very first statistics paper, “Constrained maximum entropy methods in an image reconstruction problem” (follow the link above). [sent-17, score-0.644]

15 The topic of early stopping came up in conversation not long ago and so I think this might be worth posting. [sent-18, score-0.576]

16 In 1986 or 1987 I was browsing the statistics section of the once-great Stanford bookstore and noticed (and bought) an interesting-looking book on maximum entropy and Bayesian methods. [sent-22, score-0.786]

17 Some months later I saw an announcement of a maximum entropy conference to be held in Cambridge, England, in the summer of 1988. [sent-24, score-0.798]

18 The participants at the conference were very pleasant. [sent-29, score-0.281]

19 ) The other thing I remember from the conference was that many of the participants were politically conservative. [sent-36, score-0.339]

20 Somebody else told me that the organizers of the conference were extreme Thatcherites. [sent-38, score-0.396]

