hunch_net hunch_net-2005 hunch_net-2005-9 knowledge-graph by maker-knowledge-mining

9 hunch net-2005-02-01-Watchword: Loss


meta infos for this blog

Source: html

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. [sent-1, score-2.006]

2 (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”. [sent-2, score-0.389]

3 ) Typically we try to find predictors that minimize loss. [sent-3, score-0.125]

4 There seems to be a strong dichotomy between two views of what “loss” means in learning. [sent-4, score-0.174]

5 Loss is a part of the specification of the learning problem. [sent-6, score-0.082]

6 Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. [sent-7, score-2.002]

7 To solve a problem, you optimize some particular loss function not given by the problem. [sent-9, score-1.046]

8 Examples of these loss functions are “hinge loss” (for SVMs), “log loss” (common in Bayesian Learning), and “exponential loss” (one incomplete explanation of boosting). [sent-10, score-0.997]

9 One advantage of this viewpoint is that an appropriate choice of loss function (such as any of the above) results in a (relatively tractable) convex optimization problem. [sent-11, score-1.105]

10 It seems (to some extent) like looking where the light is rather than where your keys fell on the ground. [sent-13, score-0.281]

11 Many of these losses-of-convenience also seem to have behavior unlike real world problems. [sent-14, score-0.208]

12 For example in this contest somebody would have been the winner except they happened to predict one example incorrectly with very low probability. [sent-15, score-0.589]

13 This does not seem to correspond to the intuitive notion of what the loss should be on the problem. [sent-17, score-0.954]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('loss', 0.718), ('function', 0.203), ('determined', 0.185), ('classification', 0.13), ('optimize', 0.125), ('view', 0.113), ('fell', 0.106), ('incorrectly', 0.106), ('somebody', 0.106), ('functions', 0.104), ('determines', 0.098), ('keys', 0.098), ('recall', 0.098), ('views', 0.098), ('log', 0.097), ('explanation', 0.093), ('hinge', 0.088), ('became', 0.085), ('curve', 0.085), ('intuitive', 0.085), ('roc', 0.085), ('correspond', 0.082), ('incomplete', 0.082), ('specification', 0.082), ('example', 0.081), ('precision', 0.079), ('light', 0.077), ('svms', 0.077), ('means', 0.076), ('specified', 0.075), ('behavior', 0.073), ('contest', 0.073), ('winner', 0.072), ('happened', 0.07), ('seem', 0.069), ('convex', 0.067), ('minimize', 0.066), ('unlike', 0.066), ('prediction', 0.066), ('exponential', 0.065), ('extent', 0.064), ('boosting', 0.063), ('weighted', 0.062), ('examples', 0.061), ('tractable', 0.061), ('appropriate', 0.059), ('importance', 0.059), ('multiclass', 0.059), ('predictors', 0.059), ('viewpoint', 0.058)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

2 0.55836606 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

3 0.48225033 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

4 0.45249856 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

5 0.35988873 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

6 0.331956 259 hunch net-2007-08-19-Choice of Metrics

7 0.31913385 129 hunch net-2005-11-07-Prediction Competitions

8 0.25236672 374 hunch net-2009-10-10-ALT 2009

9 0.23505117 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

10 0.22160807 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

11 0.18689083 371 hunch net-2009-09-21-Netflix finishes (and starts)

12 0.18129936 103 hunch net-2005-08-18-SVM Adaptability

13 0.17950208 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

14 0.17012346 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

15 0.16964935 235 hunch net-2007-03-03-All Models of Learning have Flaws

16 0.1691391 67 hunch net-2005-05-06-Don’t mix the solution into the problem

17 0.1516611 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

18 0.14832182 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

19 0.14409409 299 hunch net-2008-04-27-Watchword: Supervised Learning

20 0.13986199 165 hunch net-2006-03-23-The Approximation Argument


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.212), (1, 0.253), (2, 0.193), (3, -0.26), (4, -0.486), (5, 0.232), (6, -0.225), (7, 0.03), (8, 0.084), (9, 0.015), (10, 0.086), (11, -0.091), (12, -0.067), (13, 0.074), (14, -0.041), (15, 0.037), (16, 0.035), (17, -0.014), (18, -0.046), (19, 0.03), (20, 0.065), (21, -0.015), (22, -0.071), (23, -0.02), (24, -0.009), (25, 0.015), (26, -0.005), (27, 0.021), (28, 0.017), (29, 0.011), (30, 0.001), (31, -0.018), (32, 0.026), (33, -0.002), (34, 0.016), (35, 0.004), (36, 0.022), (37, -0.041), (38, -0.013), (39, 0.013), (40, 0.015), (41, 0.03), (42, 0.017), (43, 0.008), (44, 0.019), (45, 0.002), (46, 0.039), (47, 0.004), (48, 0.011), (49, -0.006)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99668443 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

2 0.97984147 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

3 0.97761291 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

4 0.94173092 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

5 0.8811062 259 hunch net-2007-08-19-Choice of Metrics

Introduction: How do we judge success in Machine Learning? As Aaron notes , the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are: The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss. The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity. The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real wor

6 0.85578531 274 hunch net-2007-11-28-Computational Consequences of Classification

7 0.79383641 374 hunch net-2009-10-10-ALT 2009

8 0.76992798 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

9 0.74490643 129 hunch net-2005-11-07-Prediction Competitions

10 0.52734822 103 hunch net-2005-08-18-SVM Adaptability

11 0.52449828 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

12 0.50567949 371 hunch net-2009-09-21-Netflix finishes (and starts)

13 0.50225949 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

14 0.47542268 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

15 0.46529454 67 hunch net-2005-05-06-Don’t mix the solution into the problem

16 0.43553492 299 hunch net-2008-04-27-Watchword: Supervised Learning

17 0.40775526 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

18 0.40564492 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

19 0.38020071 420 hunch net-2010-12-26-NIPS 2010

20 0.37742102 165 hunch net-2006-03-23-The Approximation Argument


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.049), (15, 0.218), (27, 0.418), (53, 0.082), (55, 0.041), (64, 0.025), (94, 0.024), (95, 0.026)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.92476231 312 hunch net-2008-08-04-Electoralmarkets.com

Introduction: Lance reminded me about electoralmarkets today, which is cool enough that I want to point it out explicitly here. Most people still use polls to predict who wins, while electoralmarkets uses people betting real money. They might use polling information, but any other sources of information are implicitly also allowed. A side-by-side comparison of how polls compare to prediction markets might be fun in a few months.

same-blog 2 0.92173064 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

3 0.8779518 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

4 0.87491834 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

5 0.87423766 45 hunch net-2005-03-22-Active learning

Introduction: Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework. Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {h w }: h w (x) = 1 if x > w, and 0 otherwise. VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (

6 0.8734743 483 hunch net-2013-06-10-The Large Scale Learning class notes

7 0.87281877 308 hunch net-2008-07-06-To Dual or Not

8 0.8722949 172 hunch net-2006-04-14-JMLR is a success

9 0.87183434 352 hunch net-2009-05-06-Machine Learning to AI

10 0.87155443 400 hunch net-2010-06-13-The Good News on Exploration and Learning

11 0.8715086 288 hunch net-2008-02-10-Complexity Illness

12 0.87024319 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

13 0.86901176 304 hunch net-2008-06-27-Reviewing Horror Stories

14 0.86883616 67 hunch net-2005-05-06-Don’t mix the solution into the problem

15 0.8647905 274 hunch net-2007-11-28-Computational Consequences of Classification

16 0.86221409 166 hunch net-2006-03-24-NLPers

17 0.86221409 246 hunch net-2007-06-13-Not Posting

18 0.86221409 418 hunch net-2010-12-02-Traffic Prediction Problem

19 0.86176705 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

20 0.85477084 41 hunch net-2005-03-15-The State of Tight Bounds