hunch_net hunch_net-2007 hunch_net-2007-274 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints
sentIndex sentText sentNum sentScore
1 In the regression vs classification debate , I’m adding a new “pro” to classification. [sent-1, score-0.596]
2 It seems there are computational shortcuts available for classification which simply aren’t available for regression. [sent-2, score-0.61]
3 In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. [sent-4, score-0.499]
4 Only much more modest improvements appear to be achievable for squared loss regression. [sent-5, score-1.091]
5 The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. [sent-6, score-1.513]
6 In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. [sent-7, score-1.824]
7 In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints while a squared loss regressor could not. [sent-8, score-2.15]
8 For example, if there is one feature which determines whether a binary label has probability less than or greater than 0. [sent-9, score-0.666]
9 Because squared loss is sensitive to the exact probability, many more features may be required to learn well with respect to squared loss. [sent-11, score-1.672]
wordName wordTfidf (topN-words)
[('squared', 0.424), ('loss', 0.4), ('classifier', 0.264), ('variations', 0.239), ('constraint', 0.198), ('parameter', 0.182), ('regression', 0.155), ('flat', 0.143), ('determines', 0.133), ('shortcuts', 0.133), ('learned', 0.128), ('vs', 0.125), ('available', 0.125), ('probability', 0.124), ('feature', 0.123), ('achievable', 0.12), ('classification', 0.117), ('regressor', 0.115), ('influence', 0.115), ('respect', 0.112), ('computational', 0.11), ('debate', 0.104), ('spaces', 0.104), ('exact', 0.097), ('arises', 0.097), ('adding', 0.095), ('accuracy', 0.093), ('sometimes', 0.09), ('sensitive', 0.089), ('labeled', 0.085), ('greater', 0.082), ('classifiers', 0.081), ('constraints', 0.081), ('contrast', 0.081), ('improvements', 0.08), ('binary', 0.073), ('label', 0.07), ('exists', 0.067), ('appear', 0.067), ('aren', 0.066), ('cost', 0.065), ('log', 0.065), ('required', 0.065), ('essential', 0.065), ('space', 0.064), ('features', 0.061), ('whether', 0.061), ('active', 0.06), ('either', 0.059), ('reason', 0.059)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 274 hunch net-2007-11-28-Computational Consequences of Classification
Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints
2 0.43613967 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
Introduction: For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered. Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Pr x,y~D (c(x) y) . The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, E x,y~D (s(x)-y) 2 . It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions. These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x) , then threshold at 0.5 . For this simple reduction a squared error regret of r
3 0.38712731 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi
4 0.35988873 9 hunch net-2005-02-01-Watchword: Loss
Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV
5 0.3286587 245 hunch net-2007-05-12-Loss Function Semantics
Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear
6 0.25213522 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
7 0.22800708 259 hunch net-2007-08-19-Choice of Metrics
8 0.17838493 374 hunch net-2009-10-10-ALT 2009
9 0.17219251 371 hunch net-2009-09-21-Netflix finishes (and starts)
10 0.15081684 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?
11 0.14796935 14 hunch net-2005-02-07-The State of the Reduction
12 0.14480372 258 hunch net-2007-08-12-Exponentiated Gradient
13 0.14372456 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
14 0.13608508 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models
15 0.13276488 247 hunch net-2007-06-14-Interesting Papers at COLT 2007
16 0.1298968 43 hunch net-2005-03-18-Binomial Weighting
17 0.12888232 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions
18 0.12810461 183 hunch net-2006-06-14-Explorations of Exploration
19 0.12712574 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
20 0.12473522 67 hunch net-2005-05-06-Don’t mix the solution into the problem
topicId topicWeight
[(0, 0.206), (1, 0.25), (2, 0.161), (3, -0.232), (4, -0.341), (5, 0.14), (6, -0.08), (7, -0.014), (8, -0.039), (9, 0.011), (10, 0.029), (11, -0.057), (12, -0.034), (13, 0.016), (14, -0.073), (15, -0.017), (16, -0.052), (17, 0.048), (18, -0.002), (19, 0.061), (20, 0.064), (21, 0.068), (22, 0.045), (23, 0.086), (24, 0.033), (25, -0.052), (26, 0.025), (27, 0.013), (28, -0.05), (29, -0.047), (30, 0.006), (31, 0.002), (32, -0.017), (33, -0.058), (34, 0.07), (35, 0.119), (36, 0.084), (37, -0.035), (38, 0.032), (39, -0.013), (40, 0.032), (41, -0.058), (42, 0.002), (43, 0.001), (44, -0.066), (45, 0.024), (46, 0.025), (47, -0.019), (48, 0.039), (49, -0.064)]
simIndex simValue blogId blogTitle
same-blog 1 0.9908579 274 hunch net-2007-11-28-Computational Consequences of Classification
Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints
2 0.87267607 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi
3 0.83942378 245 hunch net-2007-05-12-Loss Function Semantics
Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear
4 0.83440703 9 hunch net-2005-02-01-Watchword: Loss
Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV
5 0.81095952 259 hunch net-2007-08-19-Choice of Metrics
Introduction: How do we judge success in Machine Learning? As Aaron notes , the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are: The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss. The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity. The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real wor
6 0.80188662 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
7 0.78940219 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?
8 0.74601233 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
9 0.64646846 374 hunch net-2009-10-10-ALT 2009
10 0.55662501 67 hunch net-2005-05-06-Don’t mix the solution into the problem
11 0.55642456 129 hunch net-2005-11-07-Prediction Competitions
12 0.49834755 218 hunch net-2006-11-20-Context and the calculation misperception
13 0.47928095 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class
14 0.45416647 371 hunch net-2009-09-21-Netflix finishes (and starts)
15 0.43866947 299 hunch net-2008-04-27-Watchword: Supervised Learning
16 0.43692479 62 hunch net-2005-04-26-To calibrate or not?
17 0.42386192 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC
18 0.4226498 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models
19 0.41507748 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions
20 0.41488045 103 hunch net-2005-08-18-SVM Adaptability
topicId topicWeight
[(27, 0.853), (55, 0.023)]
simIndex simValue blogId blogTitle
same-blog 1 0.99970484 274 hunch net-2007-11-28-Computational Consequences of Classification
Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints
2 0.99964017 166 hunch net-2006-03-24-NLPers
Introduction: Hal Daume has started the NLPers blog to discuss learning for language problems.
3 0.99964017 246 hunch net-2007-06-13-Not Posting
Introduction: If you have been disappointed by the lack of a post for the last month, consider contributing your own (I’ve been busy+uninspired). Also, keep in mind that there is a community of machine learning blogs (see the sidebar).
4 0.99964017 418 hunch net-2010-12-02-Traffic Prediction Problem
Introduction: Slashdot points out the Traffic Prediction Challenge which looks pretty fun. The temporal aspect seems to be very common in many real-world problems and somewhat understudied.
5 0.99875641 247 hunch net-2007-06-14-Interesting Papers at COLT 2007
Introduction: Here are two papers that seem particularly interesting at this year’s COLT. Gilles Blanchard and François Fleuret , Occam’s Hammer . When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight . A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper. Adam Tauman Kalai . Learning Nested Halfspaces and Uphill Decision Trees . Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily ch
6 0.99603409 308 hunch net-2008-07-06-To Dual or Not
7 0.99459255 172 hunch net-2006-04-14-JMLR is a success
8 0.9945116 288 hunch net-2008-02-10-Complexity Illness
9 0.99404675 245 hunch net-2007-05-12-Loss Function Semantics
10 0.9932965 400 hunch net-2010-06-13-The Good News on Exploration and Learning
11 0.98765099 45 hunch net-2005-03-22-Active learning
12 0.97672135 9 hunch net-2005-02-01-Watchword: Loss
13 0.9720856 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
14 0.97104472 352 hunch net-2009-05-06-Machine Learning to AI
15 0.96476227 304 hunch net-2008-06-27-Reviewing Horror Stories
16 0.95706761 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
17 0.94606525 483 hunch net-2013-06-10-The Large Scale Learning class notes
18 0.94287884 244 hunch net-2007-05-09-The Missing Bound
19 0.93880671 293 hunch net-2008-03-23-Interactive Machine Learning
20 0.93168205 67 hunch net-2005-05-06-Don’t mix the solution into the problem