hunch_net hunch_net-2009 hunch_net-2009-341 knowledge-graph by maker-knowledge-mining

341 hunch net-2009-02-04-Optimal Proxy Loss for Classification


meta infos for this blog

Source: html

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. [sent-1, score-1.722]

2 A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0. [sent-2, score-1.761]

3 While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? [sent-4, score-1.661]

4 If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. [sent-7, score-1.268]

5 The squared loss approach (discussed above) is also quite common. [sent-10, score-0.817]

6 The form of hinge loss is slightly unfamiliar, because the label is {0,1} rather than {-1,1} . [sent-16, score-1.225]

7 The optimal prediction for hinge loss is not the probability of y given x but rather some value which is at least 1 if the most likely label is 1 and 0 or smaller if the most likely label is 0 . [sent-17, score-1.725]

8 Hinge loss is not a proper scoring rule for mean, but since it does get the sign right, using it for classification is reasonable. [sent-19, score-0.878]

9 For example see Yaroslav’s old post for an argument about the comparison of log loss and hinge loss and why hinge loss might be better. [sent-21, score-3.019]

10 Restated, there is no reason other than representational convenience that f w (x) needs to take a value outside of the interval [0,1] for squared loss or hinge loss. [sent-24, score-1.412]

11 The implication is that optimization of log loss can be unstable in ways that optimization of these other losses is not. [sent-26, score-0.944]

12 This can be stated precisely by noting that sample complexity bounds (simple ones here ) for 0-1 loss hold for f w ‘(x) under squared or hinge loss, but the same theorem statement does not hold for log loss without additional assumptions. [sent-27, score-2.235]

13 For log loss and squared loss, any other threshold is inconsistent. [sent-32, score-1.095]

14 Since the optimal predictor for hinge loss always takes value 0 or 1 , there is some freedom in how we convert, but a reasonable approach is to also threshold at 0. [sent-33, score-1.48]

15 In other words, if an adversary picks the true conditional probability distribution p(y|x) and the prediction f w ‘(x) , how does the proxy loss of f w ‘(x) bound the 0-1 loss? [sent-36, score-1.052]

16 For hinge loss, the regret is eps and for squared loss the regret is eps 2 . [sent-44, score-1.941]

17 Since we are only interested in regrets less than 1 , the square root is undesirable, and hinge loss is preferred, because a stronger convergence of squared loss is needed to achieve the same guarantee on 0-1 loss. [sent-47, score-1.978]

18 I don’t know any proxy loss which is quantitatively better, but generalizations exist. [sent-49, score-0.834]

19 The regret of hinge loss is the same as for absolute value loss |y-f w ‘(x)| since they are identical for 0,1 labels. [sent-50, score-2.086]

20 One advantage of absolute value loss is that it has a known and sometimes useful semantics for values between 0 and 1 : the optimal prediction is the median. [sent-51, score-1.032]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('loss', 0.637), ('hinge', 0.476), ('eps', 0.212), ('squared', 0.18), ('proxy', 0.159), ('log', 0.156), ('threshold', 0.122), ('label', 0.112), ('regret', 0.112), ('optimal', 0.092), ('probability', 0.082), ('value', 0.081), ('losses', 0.077), ('since', 0.076), ('predictor', 0.072), ('optimizing', 0.069), ('max', 0.067), ('absolute', 0.067), ('dot', 0.067), ('conditional', 0.065), ('scoring', 0.064), ('product', 0.061), ('prediction', 0.059), ('stability', 0.059), ('function', 0.058), ('proper', 0.057), ('semantics', 0.057), ('predicted', 0.051), ('imposed', 0.051), ('bounds', 0.051), ('adversary', 0.05), ('hold', 0.049), ('convergence', 0.048), ('optimize', 0.045), ('rule', 0.044), ('regression', 0.041), ('vector', 0.041), ('advantage', 0.039), ('shares', 0.038), ('convert', 0.038), ('dissimilar', 0.038), ('popularized', 0.038), ('qualitatively', 0.038), ('confine', 0.038), ('qualitative', 0.038), ('quantitatively', 0.038), ('undesirable', 0.038), ('take', 0.038), ('likely', 0.037), ('optimization', 0.037)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

2 0.57299441 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

3 0.55836606 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

4 0.41569108 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

5 0.38712731 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

6 0.34549713 259 hunch net-2007-08-19-Choice of Metrics

7 0.23644122 374 hunch net-2009-10-10-ALT 2009

8 0.22807188 129 hunch net-2005-11-07-Prediction Competitions

9 0.2258358 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

10 0.22387797 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

11 0.20395921 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

12 0.19682983 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

13 0.17712553 371 hunch net-2009-09-21-Netflix finishes (and starts)

14 0.16335315 258 hunch net-2007-08-12-Exponentiated Gradient

15 0.14574024 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

16 0.14014785 103 hunch net-2005-08-18-SVM Adaptability

17 0.13975181 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

18 0.13075204 67 hunch net-2005-05-06-Don’t mix the solution into the problem

19 0.12747557 165 hunch net-2006-03-23-The Approximation Argument

20 0.12445451 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.199), (1, 0.27), (2, 0.212), (3, -0.282), (4, -0.485), (5, 0.219), (6, -0.192), (7, 0.006), (8, 0.047), (9, 0.03), (10, 0.114), (11, -0.075), (12, -0.054), (13, 0.034), (14, -0.065), (15, 0.021), (16, -0.037), (17, 0.014), (18, -0.028), (19, 0.055), (20, 0.073), (21, 0.033), (22, -0.052), (23, 0.033), (24, 0.018), (25, 0.013), (26, -0.001), (27, -0.002), (28, 0.009), (29, -0.017), (30, -0.004), (31, -0.001), (32, 0.033), (33, 0.025), (34, -0.003), (35, -0.023), (36, 0.036), (37, -0.017), (38, -0.03), (39, -0.029), (40, -0.021), (41, 0.016), (42, 0.011), (43, 0.01), (44, -0.017), (45, 0.016), (46, 0.032), (47, -0.012), (48, 0.025), (49, -0.013)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99774474 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

2 0.97767901 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

3 0.97014928 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

4 0.90488249 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

5 0.89161772 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

6 0.88330805 259 hunch net-2007-08-19-Choice of Metrics

7 0.78556627 374 hunch net-2009-10-10-ALT 2009

8 0.77037561 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

9 0.70372093 129 hunch net-2005-11-07-Prediction Competitions

10 0.54285777 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

11 0.49462223 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

12 0.47600922 371 hunch net-2009-09-21-Netflix finishes (and starts)

13 0.46752843 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

14 0.46421248 67 hunch net-2005-05-06-Don’t mix the solution into the problem

15 0.4570598 299 hunch net-2008-04-27-Watchword: Supervised Learning

16 0.43576223 103 hunch net-2005-08-18-SVM Adaptability

17 0.42087752 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

18 0.40443489 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

19 0.3876673 118 hunch net-2005-10-07-On-line learning of regular decision rules

20 0.3709856 258 hunch net-2007-08-12-Exponentiated Gradient


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.014), (3, 0.085), (9, 0.046), (24, 0.212), (27, 0.369), (38, 0.02), (53, 0.01), (55, 0.043), (77, 0.037), (94, 0.036), (95, 0.014)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.91572887 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

2 0.85323733 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

3 0.84850359 45 hunch net-2005-03-22-Active learning

Introduction: Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework. Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {h w }: h w (x) = 1 if x > w, and 0 otherwise. VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (

4 0.84584522 183 hunch net-2006-06-14-Explorations of Exploration

Introduction: Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models. Reinforcement Learning (1) . Reinforcement learning has traditionally focused on Markov Decision Processes where the next state s’ is given by a conditional distribution P(s’|s,a) given the current state s and action a . The typical result here is that certain specific algorithms controlling an agent can behave within e of optimal for horizon T except for poly(1/e,T,S,A) “wasted” experiences (with high probability). This started with E 3 by Satinder Singh and Michael Kearns . Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E 3 and

5 0.84539038 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

6 0.84449661 12 hunch net-2005-02-03-Learning Theory, by assumption

7 0.84320349 400 hunch net-2010-06-13-The Good News on Exploration and Learning

8 0.84302133 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

9 0.84273529 289 hunch net-2008-02-17-The Meaning of Confidence

10 0.84136659 304 hunch net-2008-06-27-Reviewing Horror Stories

11 0.84123087 172 hunch net-2006-04-14-JMLR is a success

12 0.84050888 288 hunch net-2008-02-10-Complexity Illness

13 0.83999413 352 hunch net-2009-05-06-Machine Learning to AI

14 0.83885175 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

15 0.83867681 308 hunch net-2008-07-06-To Dual or Not

16 0.83744574 274 hunch net-2007-11-28-Computational Consequences of Classification

17 0.83493197 244 hunch net-2007-05-09-The Missing Bound

18 0.83349943 166 hunch net-2006-03-24-NLPers

19 0.83349943 246 hunch net-2007-06-13-Not Posting

20 0.83349943 418 hunch net-2010-12-02-Traffic Prediction Problem