hunch_net hunch_net-2007 hunch_net-2007-245 knowledge-graph by maker-knowledge-mining

245 hunch net-2007-05-12-Loss Function Semantics


meta infos for this blog

Source: html

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. [sent-1, score-1.276]

2 Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . [sent-2, score-1.283]

3 Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . [sent-3, score-1.14]

4 Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . [sent-6, score-1.153]

5 The semantics (= meaning) of the loss are made explicit by a theorem in each case. [sent-7, score-1.009]

6 Everyone doing general machine learning should be aware of the laundry list above. [sent-10, score-0.298]

7 They form a handy toolkit which can match many of the problems naturally encountered. [sent-11, score-0.32]

8 People also try to optimize a variety of other loss functions. [sent-12, score-0.686]

9 Some of these are (effectively) a special case of the above. [sent-13, score-0.206]

10 For example, “hinge loss” is absolute value loss when the hinge point is at the upper range. [sent-14, score-1.065]

11 Some of the other losses do not have any known semantics. [sent-15, score-0.07]

12 In this case, discovering a semantics could be quite valuable. [sent-16, score-0.384]

13 The natural direction when thinking about how to solve a problem is to start with the semantics you want and then derive a loss. [sent-17, score-0.386]

14 I don’t know of any general way to do this other than simply applying the laundry list above. [sent-18, score-0.368]

15 As one example, what is a loss function for estimating the mean of a random variable y over the 5th to 95th quantile? [sent-19, score-0.874]

16 (How do we do squared error regression which is insensitive to outliers? [sent-20, score-0.184]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('loss', 0.538), ('semantics', 0.311), ('sq', 0.234), ('laundry', 0.208), ('optimizing', 0.187), ('squared', 0.184), ('absolute', 0.181), ('hinge', 0.173), ('log', 0.142), ('special', 0.132), ('distributions', 0.129), ('meaning', 0.121), ('conditional', 0.117), ('mean', 0.112), ('means', 0.112), ('value', 0.11), ('arg', 0.104), ('toolkit', 0.104), ('predicting', 0.103), ('theorem', 0.1), ('median', 0.096), ('quantile', 0.091), ('list', 0.09), ('variety', 0.087), ('estimating', 0.08), ('extended', 0.08), ('function', 0.079), ('length', 0.078), ('match', 0.078), ('minimizing', 0.078), ('derive', 0.075), ('min', 0.075), ('case', 0.074), ('handy', 0.073), ('discovering', 0.073), ('losses', 0.07), ('applying', 0.07), ('variants', 0.07), ('hold', 0.067), ('form', 0.065), ('variable', 0.065), ('description', 0.063), ('manner', 0.063), ('upper', 0.063), ('handle', 0.062), ('optimize', 0.061), ('explicit', 0.06), ('theorems', 0.06), ('predictors', 0.058), ('independent', 0.058)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

2 0.57299441 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

3 0.48225033 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

4 0.36889857 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

5 0.3286587 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

6 0.31551817 259 hunch net-2007-08-19-Choice of Metrics

7 0.21577363 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

8 0.20601578 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

9 0.20448059 374 hunch net-2009-10-10-ALT 2009

10 0.19429749 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

11 0.17718923 129 hunch net-2005-11-07-Prediction Competitions

12 0.16010006 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

13 0.15337592 371 hunch net-2009-09-21-Netflix finishes (and starts)

14 0.14570999 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

15 0.13970149 103 hunch net-2005-08-18-SVM Adaptability

16 0.13761508 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

17 0.12262534 67 hunch net-2005-05-06-Don’t mix the solution into the problem

18 0.12189317 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

19 0.12109784 420 hunch net-2010-12-26-NIPS 2010

20 0.11919262 258 hunch net-2007-08-12-Exponentiated Gradient


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.178), (1, 0.227), (2, 0.19), (3, -0.238), (4, -0.454), (5, 0.211), (6, -0.191), (7, 0.022), (8, 0.069), (9, 0.032), (10, 0.129), (11, -0.098), (12, -0.057), (13, 0.071), (14, -0.055), (15, 0.008), (16, -0.006), (17, 0.008), (18, -0.038), (19, 0.041), (20, 0.071), (21, 0.022), (22, -0.063), (23, -0.016), (24, 0.008), (25, 0.019), (26, -0.014), (27, -0.004), (28, -0.025), (29, 0.004), (30, -0.025), (31, 0.008), (32, 0.006), (33, 0.037), (34, 0.017), (35, -0.019), (36, 0.011), (37, 0.009), (38, 0.01), (39, -0.04), (40, -0.013), (41, 0.032), (42, -0.012), (43, -0.028), (44, -0.019), (45, 0.005), (46, 0.017), (47, 0.051), (48, 0.047), (49, -0.037)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99735111 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

2 0.97803974 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

3 0.97165537 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

4 0.90369725 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

5 0.86192101 259 hunch net-2007-08-19-Choice of Metrics

Introduction: How do we judge success in Machine Learning? As Aaron notes , the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are: The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss. The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity. The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real wor

6 0.85609126 274 hunch net-2007-11-28-Computational Consequences of Classification

7 0.78735673 374 hunch net-2009-10-10-ALT 2009

8 0.74210232 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

9 0.67985022 129 hunch net-2005-11-07-Prediction Competitions

10 0.51010346 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

11 0.47464883 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

12 0.47414798 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

13 0.45673493 299 hunch net-2008-04-27-Watchword: Supervised Learning

14 0.4527958 371 hunch net-2009-09-21-Netflix finishes (and starts)

15 0.44326064 103 hunch net-2005-08-18-SVM Adaptability

16 0.4334729 67 hunch net-2005-05-06-Don’t mix the solution into the problem

17 0.4194034 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

18 0.40010899 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

19 0.38114065 118 hunch net-2005-10-07-On-line learning of regular decision rules

20 0.36429808 258 hunch net-2007-08-12-Exponentiated Gradient


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.05), (3, 0.055), (24, 0.011), (27, 0.66), (53, 0.034), (55, 0.034), (77, 0.019), (94, 0.023)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99922115 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

2 0.99186206 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

Introduction: Here are two papers that seem particularly interesting at this year’s COLT. Gilles Blanchard and François Fleuret , Occam’s Hammer . When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight . A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper. Adam Tauman Kalai . Learning Nested Halfspaces and Uphill Decision Trees . Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily ch

3 0.99110299 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

4 0.99045122 308 hunch net-2008-07-06-To Dual or Not

Introduction: Yoram and Shai ‘s online learning tutorial at ICML brings up a question for me, “Why use the dual ?” The basic setting is learning a weight vector w i so that the function f(x)= sum i w i x i optimizes some convex loss function. The functional view of the dual is that instead of (or in addition to) keeping track of w i over the feature space, you keep track of a vector a j over the examples and define w i = sum j a j x ji . The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful. Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpfu

5 0.99042988 400 hunch net-2010-06-13-The Good News on Exploration and Learning

Introduction: Consider the contextual bandit setting where, repeatedly: A context x is observed. An action a is taken given the context x . A reward r is observed, dependent on x and a . Where the goal of a learning agent is to find a policy for step 2 achieving a large expected reward. This setting is of obvious importance, because in the real world we typically make decisions based on some set of information and then get feedback only about the single action taken. It also fundamentally differs from supervised learning settings because knowing the value of one action is not equivalent to knowing the value of all actions. A decade ago the best machine learning techniques for this setting where implausibly inefficient. Dean Foster once told me he thought the area was a research sinkhole with little progress to be expected. Now we are on the verge of being able to routinely attack these problems, in almost exactly the same sense that we routinely attack bread and but

6 0.99002814 166 hunch net-2006-03-24-NLPers

7 0.99002814 246 hunch net-2007-06-13-Not Posting

8 0.99002814 418 hunch net-2010-12-02-Traffic Prediction Problem

9 0.98959434 288 hunch net-2008-02-10-Complexity Illness

10 0.98958534 172 hunch net-2006-04-14-JMLR is a success

11 0.98878217 45 hunch net-2005-03-22-Active learning

12 0.98636389 9 hunch net-2005-02-01-Watchword: Loss

13 0.98456502 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

14 0.97516733 352 hunch net-2009-05-06-Machine Learning to AI

15 0.97049612 304 hunch net-2008-06-27-Reviewing Horror Stories

16 0.96401876 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

17 0.95265198 244 hunch net-2007-05-09-The Missing Bound

18 0.95264119 483 hunch net-2013-06-10-The Large Scale Learning class notes

19 0.95111877 133 hunch net-2005-11-28-A question of quantification

20 0.95056474 67 hunch net-2005-05-06-Don’t mix the solution into the problem