hunch_net hunch_net-2007 hunch_net-2007-259 knowledge-graph by maker-knowledge-mining

259 hunch net-2007-08-19-Choice of Metrics


meta infos for this blog

Source: html

Introduction: How do we judge success in Machine Learning? As Aaron notes , the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are: The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss. The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity. The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real wor


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 As Aaron notes , the best way is to use the loss imposed on you by the world. [sent-2, score-0.554]

2 The prediction is used by some other system which expects some semantics to the predicted value. [sent-5, score-0.365]

3 The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). [sent-7, score-0.494]

4 In these situations, it’s unclear what metric for evaluation should be chosen. [sent-8, score-0.573]

5 I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real world or a loss which is optimized by a learning algorithm. [sent-10, score-1.08]

6 This property trumps all other concerns and every effort argument that metric A better reflects the real problem than metric B must be carefully evaluated. [sent-13, score-1.205]

7 It’s not generally fair to exclude a method because it errs once, because losses imposed by the real world are often bounded. [sent-22, score-0.415]

8 This is a failure in the choice of metric of evaluation as much as a failure of the prediction system. [sent-23, score-0.756]

9 This means, for example, that we can construct examples of systems where a test set of size m typically produces a loss ordering of system A > system B but a test set of size 2m reverses the typical ordering. [sent-27, score-1.367]

10 Another way to suffer the effects of nonconvergence above is to have a metric which uses some formula on an entire set of examples to produce a prediction. [sent-31, score-0.673]

11 A simple example of this is “area under the ROC curve” which becomes very unstable when the set of test examples is “lopsided”—i. [sent-32, score-0.406]

12 A reasonable approach is: compare the metric on the best constant predictor to the minimum of the metric. [sent-40, score-0.744]

13 To remove this “improvement” from consideration, normalizing by a bound on the loss appears reasonable. [sent-42, score-0.434]

14 This implies that we measure varation as (loss of best constant predictor – minimal possible loss) / (maximal loss – minimal loss). [sent-43, score-0.833]

15 As an example, squared loss— (y’ – y) 2 would have variation 0. [sent-44, score-0.387]

16 When possible, using the actual distribution over y to compute the loss of the best constant predictor is even better. [sent-49, score-0.617]

17 It’s useful to have a semantics to the metric, because it makes communication of the metric easy. [sent-51, score-0.672]

18 It’s useful to have the metric be simple because a good metric will be implemented multiple times by multiple people. [sent-53, score-1.01]

19 For example, If you care about probabilistic semantics, squared loss seems superior to log loss (i. [sent-56, score-0.931]

20 log (1/p(true y) where p() is the predicted value, because squared loss is bounded. [sent-58, score-0.628]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('metric', 0.505), ('loss', 0.378), ('variation', 0.232), ('auc', 0.167), ('semantics', 0.167), ('test', 0.164), ('vary', 0.122), ('ordering', 0.122), ('losses', 0.113), ('imposed', 0.113), ('conditioning', 0.112), ('set', 0.103), ('squared', 0.099), ('mixed', 0.098), ('constant', 0.096), ('simplicity', 0.086), ('evaluating', 0.083), ('pairs', 0.081), ('size', 0.08), ('predictor', 0.08), ('minimal', 0.079), ('log', 0.076), ('predicted', 0.075), ('example', 0.074), ('substantially', 0.069), ('prediction', 0.069), ('evaluation', 0.068), ('fair', 0.068), ('advice', 0.068), ('property', 0.067), ('real', 0.065), ('examples', 0.065), ('best', 0.063), ('argument', 0.063), ('predictive', 0.063), ('always', 0.059), ('except', 0.058), ('measure', 0.058), ('correct', 0.058), ('failure', 0.057), ('reverse', 0.056), ('unknowable', 0.056), ('saved', 0.056), ('boundedness', 0.056), ('disaster', 0.056), ('exclude', 0.056), ('murkier', 0.056), ('normalizing', 0.056), ('would', 0.056), ('system', 0.054)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999976 259 hunch net-2007-08-19-Choice of Metrics

Introduction: How do we judge success in Machine Learning? As Aaron notes , the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are: The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss. The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity. The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real wor

2 0.34549713 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

3 0.331956 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

4 0.31551817 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

5 0.23878857 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

6 0.22800708 274 hunch net-2007-11-28-Computational Consequences of Classification

7 0.21778515 129 hunch net-2005-11-07-Prediction Competitions

8 0.20669644 371 hunch net-2009-09-21-Netflix finishes (and starts)

9 0.17580363 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

10 0.1545051 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

11 0.15340567 374 hunch net-2009-10-10-ALT 2009

12 0.14751108 6 hunch net-2005-01-27-Learning Complete Problems

13 0.13388649 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

14 0.13246365 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

15 0.13041152 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

16 0.12731585 67 hunch net-2005-05-06-Don’t mix the solution into the problem

17 0.12527286 235 hunch net-2007-03-03-All Models of Learning have Flaws

18 0.1252507 165 hunch net-2006-03-23-The Approximation Argument

19 0.12396877 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

20 0.11576459 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.252), (1, 0.2), (2, 0.139), (3, -0.154), (4, -0.296), (5, 0.15), (6, -0.145), (7, 0.043), (8, 0.066), (9, -0.017), (10, 0.034), (11, 0.068), (12, 0.01), (13, 0.018), (14, -0.021), (15, -0.015), (16, -0.029), (17, 0.022), (18, -0.054), (19, 0.067), (20, -0.004), (21, 0.011), (22, -0.061), (23, 0.025), (24, -0.007), (25, 0.015), (26, 0.011), (27, -0.066), (28, -0.023), (29, -0.049), (30, -0.006), (31, -0.027), (32, 0.073), (33, 0.034), (34, 0.034), (35, 0.031), (36, 0.013), (37, 0.014), (38, -0.045), (39, -0.01), (40, 0.039), (41, -0.029), (42, 0.019), (43, -0.005), (44, -0.004), (45, 0.022), (46, 0.021), (47, -0.01), (48, 0.039), (49, -0.015)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99106324 259 hunch net-2007-08-19-Choice of Metrics

Introduction: How do we judge success in Machine Learning? As Aaron notes , the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are: The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss. The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity. The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real wor

2 0.90086901 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

3 0.88611042 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

4 0.87683707 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

5 0.84747434 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

6 0.8472141 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

7 0.80150318 129 hunch net-2005-11-07-Prediction Competitions

8 0.78239679 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

9 0.7086432 374 hunch net-2009-10-10-ALT 2009

10 0.67098802 371 hunch net-2009-09-21-Netflix finishes (and starts)

11 0.58716673 67 hunch net-2005-05-06-Don’t mix the solution into the problem

12 0.5268237 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

13 0.52144337 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

14 0.52116328 430 hunch net-2011-04-11-The Heritage Health Prize

15 0.50812727 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

16 0.47366756 177 hunch net-2006-05-05-An ICML reject

17 0.46043146 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC

18 0.45749754 62 hunch net-2005-04-26-To calibrate or not?

19 0.45668101 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

20 0.45155495 118 hunch net-2005-10-07-On-line learning of regular decision rules


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.02), (3, 0.056), (9, 0.025), (12, 0.106), (27, 0.253), (38, 0.095), (48, 0.033), (51, 0.025), (53, 0.02), (55, 0.069), (56, 0.019), (77, 0.065), (79, 0.024), (94, 0.072), (95, 0.021)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95656031 259 hunch net-2007-08-19-Choice of Metrics

Introduction: How do we judge success in Machine Learning? As Aaron notes , the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are: The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss. The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity. The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real wor

2 0.92281824 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

Introduction: There were two papers at ICML presenting learning algorithms for a contextual bandit -style setting, where the loss for all labels is not known, but the loss for one label is known. (The first might require a exploration scavenging viewpoint to understand if the experimental assignment was nonrandom.) I strongly approve of these papers and further work in this setting and its variants, because I expect it to become more important than supervised learning. As a quick review, we are thinking about situations where repeatedly: The world reveals feature values (aka context information). A policy chooses an action. The world provides a reward. Sometimes this is done in an online fashion where the policy can change based on immediate feedback and sometimes it’s done in a batch setting where many samples are collected before the policy can change. If you haven’t spent time thinking about the setting, you might want to because there are many natural applications. I’m g

3 0.8867268 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

4 0.88486171 289 hunch net-2008-02-17-The Meaning of Confidence

Introduction: In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused. Confidence = Probability . For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5 , but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain. Classical Confidence Intervals . These a

5 0.88382471 388 hunch net-2010-01-24-Specializations of the Master Problem

Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of

6 0.88293731 183 hunch net-2006-06-14-Explorations of Exploration

7 0.87714446 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

8 0.87684214 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

9 0.87672675 14 hunch net-2005-02-07-The State of the Reduction

10 0.8750245 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

11 0.87410289 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

12 0.87101161 351 hunch net-2009-05-02-Wielding a New Abstraction

13 0.87054586 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

14 0.86972892 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

15 0.86834615 160 hunch net-2006-03-02-Why do people count for learning?

16 0.86818331 131 hunch net-2005-11-16-The Everything Ensemble Edge

17 0.86815178 258 hunch net-2007-08-12-Exponentiated Gradient

18 0.86670345 133 hunch net-2005-11-28-A question of quantification

19 0.86580861 41 hunch net-2005-03-15-The State of Tight Bounds

20 0.865381 12 hunch net-2005-02-03-Learning Theory, by assumption