hunch_net hunch_net-2005 hunch_net-2005-23 knowledge-graph by maker-knowledge-mining

23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models


meta infos for this blog

Source: html

Introduction: This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005 . I found this paper very difficult to read, but it does have some point about a computational shortcut. This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else? All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sum y f(x,y) (the paper calls - log f(x,y) an “energy”). If f is parameterized by some w , the gradient has a term for Z(x) , and hence for every value of y . The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y) . This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimen


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005 . [sent-1, score-0.356]

2 I found this paper very difficult to read, but it does have some point about a computational shortcut. [sent-2, score-0.373]

3 This paper takes for granted that the method of solving a problem is gradient descent on parameters. [sent-3, score-0.691]

4 Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else? [sent-4, score-0.576]

5 All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sum y f(x,y) (the paper calls - log f(x,y) an “energy”). [sent-5, score-0.747]

6 If f is parameterized by some w , the gradient has a term for Z(x) , and hence for every value of y . [sent-6, score-0.335]

7 The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y) . [sent-7, score-0.65]

8 The paper further claims that this can be done with an approximate maximum. [sent-9, score-0.966]

9 These claims are plausible based on experimental results and intuition. [sent-10, score-0.579]

10 It wouldn’t surprise me to learn that ignoring Z(x) (and renormalizing later) is common in fast implementations of some probabilistic model fitting algorithms, but I haven’t seen this previously discussed. [sent-11, score-0.674]

11 Ability to use an approximate maximum y’ seems potentially very useful. [sent-12, score-0.294]

12 With that encouragement, I had significant difficulties with the paper, including the following: Lack of a theorem. [sent-13, score-0.132]

13 A good theorem proving these things work would be quite helpful. [sent-14, score-0.084]

14 It isn’t clear whether the claims are always true, just true on the examples encountered, or true with some small modification. [sent-15, score-0.757]

15 For better or worse, the paper uses the second definition of loss , “Loss is part of the solution”, which I find unnatural. [sent-17, score-0.544]

16 Claims I don’t understand or which aren’t technically true. [sent-18, score-0.118]

17 For example, there is a claim that log-loss is the “only well-justified loss function”. [sent-20, score-0.142]

18 The meaning of well-justified is unclear, and I can think of several meanings where other losses (such as squared error) are well-justified. [sent-21, score-0.184]

19 With the above difficulties, this paper seems lucky to have been accepted. [sent-22, score-0.445]

20 This isn’t a criticism of AISTAT because it also seems plausible that this computational shortcut may eventually help many optimizations. [sent-23, score-0.569]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('claims', 0.457), ('paper', 0.282), ('aistat', 0.19), ('probabilistic', 0.187), ('gradient', 0.169), ('true', 0.15), ('approximate', 0.147), ('loss', 0.142), ('difficulties', 0.132), ('descent', 0.13), ('plausible', 0.122), ('definition', 0.12), ('implementations', 0.118), ('shortcut', 0.118), ('technically', 0.118), ('calls', 0.11), ('encouragement', 0.11), ('granted', 0.11), ('maximizes', 0.11), ('meanings', 0.104), ('ignoring', 0.099), ('optimizations', 0.099), ('models', 0.097), ('parameterized', 0.095), ('lucky', 0.091), ('surprise', 0.091), ('wouldn', 0.091), ('computational', 0.091), ('model', 0.09), ('criticism', 0.089), ('energy', 0.089), ('fitting', 0.089), ('isn', 0.084), ('proving', 0.084), ('purposes', 0.084), ('lecun', 0.082), ('done', 0.08), ('encountered', 0.08), ('losses', 0.08), ('arises', 0.08), ('eventually', 0.077), ('optimized', 0.077), ('yann', 0.075), ('maximum', 0.075), ('published', 0.074), ('seems', 0.072), ('sum', 0.071), ('hence', 0.071), ('main', 0.071), ('none', 0.071)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

Introduction: This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005 . I found this paper very difficult to read, but it does have some point about a computational shortcut. This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else? All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sum y f(x,y) (the paper calls - log f(x,y) an “energy”). If f is parameterized by some w , the gradient has a term for Z(x) , and hence for every value of y . The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y) . This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimen

2 0.16683687 111 hunch net-2005-09-12-Fast Gradient Descent

Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be

3 0.1516611 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

4 0.14735511 258 hunch net-2007-08-12-Exponentiated Gradient

Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve

5 0.14574024 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

6 0.14570999 245 hunch net-2007-05-12-Loss Function Semantics

7 0.14548792 275 hunch net-2007-11-29-The Netflix Crack

8 0.13608508 274 hunch net-2007-11-28-Computational Consequences of Classification

9 0.13390207 330 hunch net-2008-12-07-A NIPS paper

10 0.13217556 235 hunch net-2007-03-03-All Models of Learning have Flaws

11 0.12427854 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

12 0.12297229 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

13 0.12010767 440 hunch net-2011-08-06-Interesting thing at UAI 2011

14 0.11576459 259 hunch net-2007-08-19-Choice of Metrics

15 0.11231641 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?

16 0.10642394 38 hunch net-2005-03-09-Bad Reviewing

17 0.10550538 361 hunch net-2009-06-24-Interesting papers at UAICMOLT 2009

18 0.10417209 304 hunch net-2008-06-27-Reviewing Horror Stories

19 0.10332897 343 hunch net-2009-02-18-Decision by Vetocracy

20 0.10248584 97 hunch net-2005-07-23-Interesting papers at ACL


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.244), (1, 0.084), (2, 0.116), (3, -0.036), (4, -0.066), (5, 0.099), (6, -0.104), (7, -0.009), (8, 0.009), (9, 0.012), (10, 0.027), (11, -0.055), (12, -0.105), (13, -0.059), (14, 0.022), (15, 0.04), (16, 0.011), (17, 0.086), (18, 0.113), (19, -0.072), (20, -0.066), (21, 0.111), (22, -0.076), (23, -0.007), (24, -0.026), (25, 0.029), (26, -0.047), (27, -0.009), (28, -0.023), (29, -0.094), (30, -0.087), (31, 0.025), (32, -0.182), (33, -0.005), (34, -0.022), (35, -0.045), (36, -0.058), (37, -0.01), (38, 0.055), (39, 0.018), (40, -0.061), (41, -0.109), (42, -0.045), (43, 0.083), (44, 0.009), (45, -0.095), (46, -0.01), (47, -0.056), (48, 0.068), (49, 0.02)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98182333 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

Introduction: This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005 . I found this paper very difficult to read, but it does have some point about a computational shortcut. This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else? All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sum y f(x,y) (the paper calls - log f(x,y) an “energy”). If f is parameterized by some w , the gradient has a term for Z(x) , and hence for every value of y . The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y) . This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimen

2 0.62462157 440 hunch net-2011-08-06-Interesting thing at UAI 2011

Introduction: I had a chance to attend UAI this year, where several papers interested me, including: Hoifung Poon and Pedro Domingos Sum-Product Networks: A New Deep Architecture . We’ve already discussed this one , but in a nutshell, they identify a large class of efficiently normalizable distributions and do learning with it. Yao-Liang Yu and Dale Schuurmans , Rank/norm regularization with closed-form solutions: Application to subspace clustering . This paper is about matrices, and in particular they prove that certain matrices are the solution of matrix optimizations. I’m not matrix inclined enough to fully appreciate this one, but I believe many others may be, and anytime closed form solutions come into play, you get 2 order of magnitude speedups, as they show experimentally. Laurent Charlin , Richard Zemel and Craig Boutilier , A Framework for Optimizing Paper Matching . This is about what works in matching papers to reviewers, as has been tested at several previous

3 0.60229725 330 hunch net-2008-12-07-A NIPS paper

Introduction: I’m skipping NIPS this year in favor of Ada , but I wanted to point out this paper by Andriy Mnih and Geoff Hinton . The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches. I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression. There are a couple workarounds for this computational bug: Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application. Avoid. Y

4 0.5882287 77 hunch net-2005-05-29-Maximum Margin Mismatch?

Introduction: John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003. There are some things I find odd about the paper. For instance, it says of probabilistic models “cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.” I’m aware of no such limitations. Also: “Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.” This is quite interesting and contradicts my own experience as well as that of a number of people I greatly respect . I wonder what the root cause is: perhaps there is something different abo

5 0.57216012 199 hunch net-2006-07-26-Two more UAI papers of interest

Introduction: In addition to Ed Snelson’s paper, there were (at least) two other papers that caught my eye at UAI. One was this paper by Sanjoy Dasgupta, Daniel Hsu and Nakul Verma at UCSD which shows in a surprisingly general and strong way that almost all linear projections of any jointly distributed vector random variable with finite first and second moments look sphereical and unimodal (in fact look like a scale mixture of Gaussians). Great result, as you’d expect from Sanjoy. The other paper which I found intriguing but which I just haven’t groked yet is this beast by Manfred and Dima Kuzmin. You can check out the (beautiful) slides if that helps. I feel like there is something deep here, but my brain is too small to understand it. The COLT and last NIPS papers/slides are also on Manfred’s page. Hopefully someone here can illuminate.

6 0.55128008 258 hunch net-2007-08-12-Exponentiated Gradient

7 0.54596615 256 hunch net-2007-07-20-Motivation should be the Responsibility of the Reviewer

8 0.54212368 420 hunch net-2010-12-26-NIPS 2010

9 0.53667611 52 hunch net-2005-04-04-Grounds for Rejection

10 0.53186548 204 hunch net-2006-08-28-Learning Theory standards for NIPS 2006

11 0.52738816 167 hunch net-2006-03-27-Gradients everywhere

12 0.51937503 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

13 0.51753789 97 hunch net-2005-07-23-Interesting papers at ACL

14 0.51660812 439 hunch net-2011-08-01-Interesting papers at COLT 2011

15 0.50842261 111 hunch net-2005-09-12-Fast Gradient Descent

16 0.50841057 188 hunch net-2006-06-30-ICML papers

17 0.50765687 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

18 0.49588451 177 hunch net-2006-05-05-An ICML reject

19 0.49408752 280 hunch net-2007-12-20-Cool and Interesting things at NIPS, take three

20 0.49234474 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(9, 0.016), (10, 0.018), (27, 0.29), (38, 0.013), (49, 0.301), (53, 0.048), (55, 0.094), (94, 0.075), (95, 0.054)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97661257 122 hunch net-2005-10-13-Site tweak

Introduction: Several people have had difficulty with comments which seem to have an allowed language significantly poorer than posts. The set of allowed html tags has been increased and the markdown filter has been put in place to try to make commenting easier. I’ll put some examples into the comments of this post.

2 0.96372175 224 hunch net-2006-12-12-Interesting Papers at NIPS 2006

Introduction: Here are some papers that I found surprisingly interesting. Yoshua Bengio , Pascal Lamblin, Dan Popovici, Hugo Larochelle, Greedy Layer-wise Training of Deep Networks . Empirically investigates some of the design choices behind deep belief networks. Long Zhu , Yuanhao Chen, Alan Yuille Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing. An unsupervised method for detecting objects using simple feature filters that works remarkably well on the (supervised) caltech-101 dataset . Shai Ben-David , John Blitzer , Koby Crammer , and Fernando Pereira , Analysis of Representations for Domain Adaptation . This is the first analysis I’ve seen of learning with respect to samples drawn differently from the evaluation distribution which depends on reasonable measurable quantities. All of these papers turn out to have a common theme—the power of unlabeled data to do generically useful things.

same-blog 3 0.92426759 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

Introduction: This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005 . I found this paper very difficult to read, but it does have some point about a computational shortcut. This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else? All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sum y f(x,y) (the paper calls - log f(x,y) an “energy”). If f is parameterized by some w , the gradient has a term for Z(x) , and hence for every value of y . The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y) . This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimen

4 0.91955167 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project

Introduction: Today brings a new release of the Vowpal Wabbit fast online learning software. This time, unlike the previous release, the project itself is going open source, developing via github . For example, the lastest and greatest can be downloaded via: git clone git://github.com/JohnLangford/vowpal_wabbit.git If you aren’t familiar with git , it’s a distributed version control system which supports quick and easy branching, as well as reconciliation. This version of the code is confirmed to compile without complaint on at least some flavors of OSX as well as Linux boxes. As much of the point of this project is pushing the limits of fast and effective machine learning, let me mention a few datapoints from my experience. The program can effectively scale up to batch-style training on sparse terafeature (i.e. 10 12 sparse feature) size datasets. The limiting factor is typically i/o. I started using the the real datasets from the large-scale learning workshop as a conve

5 0.91944653 37 hunch net-2005-03-08-Fast Physics for Learning

Introduction: While everyone is silently working on ICML submissions, I found this discussion about a fast physics simulator chip interesting from a learning viewpoint. In many cases, learning attempts to predict the outcome of physical processes. Access to a fast simulator for these processes might be quite helpful in predicting the outcome. Bayesian learning in particular may directly benefit while many other algorithms (like support vector machines) might have their speed greatly increased. The biggest drawback is that writing software for these odd architectures is always difficult and time consuming, but a several-orders-of-magnitude speedup might make that worthwhile.

6 0.90404159 348 hunch net-2009-04-02-Asymmophobia

7 0.83201522 338 hunch net-2009-01-23-An Active Learning Survey

8 0.80876815 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

9 0.75225306 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

10 0.73253769 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

11 0.73198521 194 hunch net-2006-07-11-New Models

12 0.73185581 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

13 0.72515345 360 hunch net-2009-06-15-In Active Learning, the question changes

14 0.72062463 493 hunch net-2014-02-16-Metacademy: a package manager for knowledge

15 0.72018844 220 hunch net-2006-11-27-Continuizing Solutions

16 0.71955091 5 hunch net-2005-01-26-Watchword: Probability

17 0.71665221 343 hunch net-2009-02-18-Decision by Vetocracy

18 0.71637243 304 hunch net-2008-06-27-Reviewing Horror Stories

19 0.71370018 225 hunch net-2007-01-02-Retrospective

20 0.7124384 347 hunch net-2009-03-26-Machine Learning is too easy