hunch_net hunch_net-2006 hunch_net-2006-185 knowledge-graph by maker-knowledge-mining

185 hunch net-2006-06-16-Regularization = Robustness


meta infos for this blog

Source: html

Introduction: The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math, argmax_p H(p) s.t. E_p[f_i] = c_i is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints). A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the “geometric” version here . In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the expone


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc. [sent-1, score-0.816]

2 ) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. [sent-2, score-0.774]

3 A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. [sent-7, score-0.402]

4 ) The result has been demonstrated a number of ways. [sent-9, score-0.14]

5 In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the exponential family. [sent-11, score-1.768]

6 It’s a surprising connection and the duality it flows from appears in a wide variety of work. [sent-12, score-0.465]

7 (For instance, Martin Wainwright’s approximate inference techniques rely (in essence) on this result. [sent-13, score-0.133]

8 The traditional trick is to pull a sleight of hand in the derivation. [sent-15, score-0.284]

9 We start with the primal entropy problem, move to the dual, and in the dual add a “prior” that penalizes the lambdas. [sent-16, score-0.688]

10 ) This game is played in a variety of papers, and it’s a sleight of hand because the penalties don’t come from the motivating problem (the primal) but rather get tacked on at the end. [sent-18, score-0.452]

11 So I realized a few months back, that the primal (entropy) problem that regularization relates to is remarkably natural. [sent-20, score-0.547]

12 Basically, it tells us that regularization in the dual corresponds directly to uncertainty (mini-max) about the constraints in the primal. [sent-21, score-0.841]

13 What we end up with is a distribution p that is robust in the sense that it maximizes the entropy subject to a large set of potential constraints. [sent-22, score-0.676]

14 Schapire, have a paper that derives this relation and then goes a step further to show what performance guarantees the method provides. [sent-26, score-0.277]

15 It’s a great paper and I hope you get a chance to check it out: Performance guarantees for regularized maximum entropy density estimation. [sent-27, score-0.855]

16 ) It turns out the idea generalizes quite a bit. [sent-29, score-0.134]

17 Arkin show a related result where regularization directly follows from a robustness or uncertainty guarantee. [sent-36, score-0.586]

18 Yasemin Altun and Alex Smola have a paper (that I haven’t yet finished, but at least begins very well) that generalizes the regularized maximum entropy duality to a whole class of statistical inference procedures. [sent-38, score-1.47]

19 Unifying Divergence Minimization and Statistical Inference via Convex Duality The deep, unifying result seems to be what the title of the post says: robustness = regularization. [sent-40, score-0.367]

20 This viewpoint makes regularization seem like much less of a hack, and goes further in suggesting just what range of constants might be reasonable. [sent-41, score-0.316]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('entropy', 0.333), ('maximum', 0.213), ('duality', 0.202), ('regularization', 0.19), ('constraints', 0.19), ('primal', 0.187), ('dual', 0.168), ('flows', 0.151), ('regularized', 0.151), ('sleight', 0.151), ('distribution', 0.141), ('result', 0.14), ('tells', 0.136), ('generalizes', 0.134), ('unifying', 0.134), ('re', 0.133), ('inference', 0.133), ('expectation', 0.128), ('goes', 0.126), ('likelihood', 0.125), ('whole', 0.125), ('motivating', 0.118), ('statistical', 0.117), ('variety', 0.112), ('realized', 0.108), ('robustness', 0.093), ('schapire', 0.091), ('uncertainty', 0.091), ('features', 0.085), ('exponential', 0.082), ('check', 0.079), ('guarantees', 0.079), ('robust', 0.077), ('show', 0.072), ('hand', 0.071), ('dud', 0.067), ('fruitful', 0.067), ('hack', 0.067), ('miroslav', 0.067), ('proven', 0.067), ('steven', 0.067), ('us', 0.066), ('subject', 0.063), ('maximizes', 0.062), ('penalty', 0.062), ('relates', 0.062), ('jordan', 0.062), ('wainwright', 0.062), ('begins', 0.062), ('pull', 0.062)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000006 185 hunch net-2006-06-16-Regularization = Robustness

Introduction: The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math, argmax_p H(p) s.t. E_p[f_i] = c_i is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints). A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the “geometric” version here . In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the expone

2 0.16051762 308 hunch net-2008-07-06-To Dual or Not

Introduction: Yoram and Shai ‘s online learning tutorial at ICML brings up a question for me, “Why use the dual ?” The basic setting is learning a weight vector w i so that the function f(x)= sum i w i x i optimizes some convex loss function. The functional view of the dual is that instead of (or in addition to) keeping track of w i over the feature space, you keep track of a vector a j over the examples and define w i = sum j a j x ji . The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful. Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpfu

3 0.13429154 192 hunch net-2006-07-08-Some recent papers

Introduction: It was a fine time for learning in Pittsburgh. John and Sam mentioned some of my favorites. Here’s a few more worth checking out: Online Multitask Learning Ofer Dekel, Phil Long, Yoram Singer This is on my reading list. Definitely an area I’m interested in. Maximum Entropy Distribution Estimation with Generalized Regularization Miroslav Dudík, Robert E. Schapire Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path András Antos, Csaba Szepesvári, Rémi Munos Again, on the list to read. I saw Csaba and Remi talk about this and related work at an ICML Workshop on Kernel Reinforcement Learning. The big question in my head is how this compares/contrasts with existing work in reductions to reinforcement learning. Are there advantages/disadvantages? Higher Order Learning On Graphs> by Sameer Agarwal, Kristin Branson, and Serge Belongie, looks to be interesteding. They seem to poo-poo “tensorization

4 0.131466 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

Introduction: A few weeks ago I read this . David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it): In short I was thinking that “bayesian belief updating” and “maximum entropy” were two othogonal principles. But it appear that they are not, and that they can even be in conflict ! Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5. Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). Now consider a new piece of evidence A=”X is an odd number” Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0). But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A)) Indeed, for MaxEnt, because there is no more ‘6′, big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesn’t have to have a 3.5

5 0.1167996 33 hunch net-2005-02-28-Regularization

Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s

6 0.093075305 280 hunch net-2007-12-20-Cool and Interesting things at NIPS, take three

7 0.090931416 347 hunch net-2009-03-26-Machine Learning is too easy

8 0.086432889 352 hunch net-2009-05-06-Machine Learning to AI

9 0.08601702 258 hunch net-2007-08-12-Exponentiated Gradient

10 0.082526043 183 hunch net-2006-06-14-Explorations of Exploration

11 0.082482964 45 hunch net-2005-03-22-Active learning

12 0.080424413 235 hunch net-2007-03-03-All Models of Learning have Flaws

13 0.07402698 420 hunch net-2010-12-26-NIPS 2010

14 0.072430156 131 hunch net-2005-11-16-The Everything Ensemble Edge

15 0.071525194 34 hunch net-2005-03-02-Prior, “Prior” and Bias

16 0.068716422 77 hunch net-2005-05-29-Maximum Margin Mismatch?

17 0.067955591 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

18 0.067883343 332 hunch net-2008-12-23-Use of Learning Theory

19 0.067444652 456 hunch net-2012-02-24-ICML+50%

20 0.066904083 189 hunch net-2006-07-05-more icml papers


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.174), (1, 0.05), (2, 0.008), (3, -0.009), (4, 0.054), (5, -0.013), (6, -0.043), (7, -0.013), (8, 0.029), (9, -0.039), (10, 0.017), (11, 0.032), (12, -0.053), (13, -0.044), (14, 0.069), (15, -0.028), (16, -0.075), (17, 0.007), (18, 0.036), (19, 0.026), (20, -0.02), (21, 0.048), (22, -0.03), (23, -0.019), (24, 0.062), (25, 0.039), (26, -0.001), (27, -0.012), (28, 0.048), (29, -0.003), (30, 0.012), (31, -0.099), (32, -0.027), (33, -0.01), (34, -0.045), (35, -0.029), (36, 0.004), (37, 0.109), (38, 0.03), (39, 0.052), (40, -0.034), (41, 0.01), (42, 0.023), (43, -0.043), (44, -0.046), (45, -0.04), (46, -0.062), (47, 0.069), (48, -0.015), (49, 0.077)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96328121 185 hunch net-2006-06-16-Regularization = Robustness

Introduction: The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math, argmax_p H(p) s.t. E_p[f_i] = c_i is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints). A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the “geometric” version here . In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the expone

2 0.65014309 77 hunch net-2005-05-29-Maximum Margin Mismatch?

Introduction: John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003. There are some things I find odd about the paper. For instance, it says of probabilistic models “cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.” I’m aware of no such limitations. Also: “Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.” This is quite interesting and contradicts my own experience as well as that of a number of people I greatly respect . I wonder what the root cause is: perhaps there is something different abo

3 0.59948742 280 hunch net-2007-12-20-Cool and Interesting things at NIPS, take three

Introduction: Following up on Hal Daume’s post and John’s post on cool and interesting things seen at NIPS I’ll post my own little list of neat papers here as well. Of course it’s going to be biased towards what I think is interesting. Also, I have to say that I hadn’t been able to see many papers this year at nips due to myself being too busy, so please feel free to contribute the papers that you liked 1. P. Mudigonda, V. Kolmogorov, P. Torr. An Analysis of Convex Relaxations for MAP Estimation. A surprising paper which shows that many of the more sophisticated convex relaxations that had been proposed recently turns out to be subsumed by the simplest LP relaxation. Be careful next time you try a cool new convex relaxation! 2. D. Sontag, T. Jaakkola. New Outer Bounds on the Marginal Polytope. The title says it all. The marginal polytope is the set of local marginal distributions over subsets of variables that are globally consistent in the sense that there is at least one distributio

4 0.53784543 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is

5 0.53742653 123 hunch net-2005-10-16-Complexity: It’s all in your head

Introduction: One of the central concerns of learning is to understand and to prevent overfitting. Various notion of “function complexity” often arise: VC dimension, Rademacher complexity, comparison classes of experts, and program length are just a few. The term “complexity” to me seems somehow misleading; the terms never capture something that meets my intuitive notion of complexity. The Bayesian notion clearly captures what’s going on. Functions aren’t “complex”– they’re just “surprising”: we assign to them low probability. Most (all?) complexity notions I know boil down to some (generally loose) bound on the prior probability of the function. In a sense, “complexity” fundementally arises because probability distributions must sum to one. You can’t believe in all possibilities at the same time, or at least not equally. Rather you have to carefully spread the probability mass over the options you’d like to consider. Large complexity classes means that beliefs are spread thinly. In

6 0.53665692 192 hunch net-2006-07-08-Some recent papers

7 0.53663081 139 hunch net-2005-12-11-More NIPS Papers

8 0.53622204 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

9 0.5190419 140 hunch net-2005-12-14-More NIPS Papers II

10 0.51902509 189 hunch net-2006-07-05-more icml papers

11 0.51150888 160 hunch net-2006-03-02-Why do people count for learning?

12 0.50578409 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting

13 0.49432305 33 hunch net-2005-02-28-Regularization

14 0.49427006 334 hunch net-2009-01-07-Interesting Papers at SODA 2009

15 0.49401015 199 hunch net-2006-07-26-Two more UAI papers of interest

16 0.49395961 440 hunch net-2011-08-06-Interesting thing at UAI 2011

17 0.49330398 308 hunch net-2008-07-06-To Dual or Not

18 0.49272552 144 hunch net-2005-12-28-Yet more nips thoughts

19 0.48242927 238 hunch net-2007-04-13-What to do with an unreasonable conditional accept

20 0.4777528 448 hunch net-2011-10-24-2011 ML symposium and the bears


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.023), (10, 0.019), (27, 0.183), (38, 0.027), (50, 0.322), (53, 0.057), (55, 0.082), (67, 0.022), (77, 0.042), (94, 0.053), (95, 0.083)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.8678751 185 hunch net-2006-06-16-Regularization = Robustness

Introduction: The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math, argmax_p H(p) s.t. E_p[f_i] = c_i is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints). A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the “geometric” version here . In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the expone

2 0.56812477 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class

Introduction: Yann LeCun and I are coteaching a class on Large Scale Machine Learning starting late January at NYU . This class will cover many tricks to get machine learning working well on datasets with many features, examples, and classes, along with several elements of deep learning and support systems enabling the previous. This is not a beginning class—you really need to have taken a basic machine learning class previously to follow along. Students will be able to run and experiment with large scale learning algorithms since Yahoo! has donated servers which are being configured into a small scale Hadoop cluster. We are planning to cover the frontier of research in scalable learning algorithms, so good class projects could easily lead to papers. For me, this is a chance to teach on many topics of past research. In general, it seems like researchers should engage in at least occasional teaching of research, both as a proof of teachability and to see their own research through th

3 0.56780124 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

Introduction: How do you create an optimal environment for research? Here are some essential ingredients that I see. Stability . University-based research is relatively good at this. On any particular day, researchers face choices in what they will work on. A very common tradeoff is between: easy small difficult big For researchers without stability, the ‘easy small’ option wins. This is often “ok”—a series of incremental improvements on the state of the art can add up to something very beneficial. However, it misses one of the big potentials of research: finding entirely new and better ways of doing things. Stability comes in many forms. The prototypical example is tenure at a university—a tenured professor is almost imposssible to fire which means that the professor has the freedom to consider far horizon activities. An iron-clad guarantee of a paycheck is not necessary—industrial research labs have succeeded well with research positions of indefinite duration. Atnt rese

4 0.56723928 36 hunch net-2005-03-05-Funding Research

Introduction: The funding of research (and machine learning research) is an issue which seems to have become more significant in the United States over the last decade. The word “research” is applied broadly here to science, mathematics, and engineering. There are two essential difficulties with funding research: Longshot Paying a researcher is often a big gamble. Most research projects don’t pan out, but a few big payoffs can make it all worthwhile. Information Only Much of research is about finding the right way to think about or do something. The Longshot difficulty means that there is high variance in payoffs. This can be compensated for by funding many different research projects, reducing variance. The Information-Only difficulty means that it’s hard to extract a profit directly from many types of research, so companies have difficulty justifying basic research. (Patents are a mechanism for doing this. They are often extraordinarily clumsy or simply not applicable.) T

5 0.56631839 360 hunch net-2009-06-15-In Active Learning, the question changes

Introduction: A little over 4 years ago, Sanjoy made a post saying roughly “we should study active learning theoretically, because not much is understood”. At the time, we did not understand basic things such as whether or not it was possible to PAC-learn with an active algorithm without making strong assumptions about the noise rate. In other words, the fundamental question was “can we do it?” The nature of the question has fundamentally changed in my mind. The answer is to the previous question is “yes”, both information theoretically and computationally, most places where supervised learning could be applied. In many situation, the question has now changed to: “is it worth it?” Is the programming and computational overhead low enough to make the label cost savings of active learning worthwhile? Currently, there are situations where this question could go either way. Much of the challenge for the future is in figuring out how to make active learning easier or more worthwhile.

6 0.56582224 343 hunch net-2009-02-18-Decision by Vetocracy

7 0.56463271 235 hunch net-2007-03-03-All Models of Learning have Flaws

8 0.56398147 194 hunch net-2006-07-11-New Models

9 0.56331211 225 hunch net-2007-01-02-Retrospective

10 0.56223643 220 hunch net-2006-11-27-Continuizing Solutions

11 0.56167608 466 hunch net-2012-06-05-ICML acceptance statistics

12 0.56017035 406 hunch net-2010-08-22-KDD 2010

13 0.55961418 12 hunch net-2005-02-03-Learning Theory, by assumption

14 0.55943847 464 hunch net-2012-05-03-Microsoft Research, New York City

15 0.5591647 370 hunch net-2009-09-18-Necessary and Sufficient Research

16 0.55743253 134 hunch net-2005-12-01-The Webscience Future

17 0.55680788 388 hunch net-2010-01-24-Specializations of the Master Problem

18 0.55554938 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

19 0.55460536 351 hunch net-2009-05-02-Wielding a New Abstraction

20 0.55440432 371 hunch net-2009-09-21-Netflix finishes (and starts)