hunch_net hunch_net-2005 hunch_net-2005-79 knowledge-graph by maker-knowledge-mining

79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

meta infos for this blog

Source: html

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Hal asks a very good question: “When is the right time to insert the loss function? [sent-1, score-0.838]

2 ” In particular, should it be used at testing time or at training time? [sent-2, score-0.345]

3 When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. [sent-3, score-1.321]

4 In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. [sent-4, score-1.381]

5 The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. [sent-5, score-1.041]

6 My experience is that optimizing the loss in the most direct manner possible typically yields best performance. [sent-6, score-1.0]

7 This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that solves the problem”. [sent-7, score-0.503]

8 (One difficulty with this principle is that ‘simplest’ is difficult to define in a satisfying way. [sent-8, score-0.336]

9 ) One reason why it’s unclear is that optimizing an arbitrary loss is not an easy thing for a learning algorithm to cope with. [sent-9, score-1.102]

10 Learning reductions (which I am a big fan of) give a mechanism for doing this, but they are new and relatively untried. [sent-10, score-0.231]

11 Drew Bagnell adds: Another approach to integrating loss functions into learning is to try to re-derive ideas about probability theory appropriate for other loss functions. [sent-11, score-1.36]

12 Dawid present a variant on maximum entropy learning . [sent-14, score-0.251]

13 Unfortunately, it’s even less clear how often these approaches lead to efficient algorithms. [sent-15, score-0.074]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('loss', 0.543), ('optimizing', 0.223), ('principle', 0.175), ('training', 0.168), ('possibility', 0.163), ('simplest', 0.151), ('optimize', 0.146), ('unclear', 0.142), ('fan', 0.124), ('minimizes', 0.115), ('vapnik', 0.115), ('recipe', 0.115), ('imposes', 0.115), ('probability', 0.107), ('give', 0.107), ('vladimir', 0.103), ('proxy', 0.103), ('indicate', 0.103), ('bayesian', 0.103), ('thing', 0.101), ('time', 0.1), ('asks', 0.099), ('bagnell', 0.099), ('adds', 0.099), ('solves', 0.099), ('insert', 0.096), ('cope', 0.093), ('integrating', 0.093), ('drew', 0.09), ('entropy', 0.088), ('peter', 0.088), ('satisfying', 0.088), ('confusion', 0.085), ('lecun', 0.085), ('ignore', 0.085), ('variant', 0.084), ('instance', 0.082), ('hal', 0.08), ('yann', 0.079), ('maximum', 0.079), ('direct', 0.079), ('yields', 0.079), ('question', 0.078), ('lost', 0.077), ('testing', 0.077), ('manner', 0.076), ('money', 0.076), ('lead', 0.074), ('approach', 0.074), ('define', 0.073)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999976 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

2 0.45249856 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

3 0.41569108 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

4 0.36889857 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

5 0.25213522 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

6 0.23878857 259 hunch net-2007-08-19-Choice of Metrics

7 0.19773073 374 hunch net-2009-10-10-ALT 2009

8 0.19392745 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

9 0.18082498 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

10 0.17174271 129 hunch net-2005-11-07-Prediction Competitions

11 0.16365212 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

12 0.1606624 103 hunch net-2005-08-18-SVM Adaptability

13 0.15502939 235 hunch net-2007-03-03-All Models of Learning have Flaws

14 0.13535511 165 hunch net-2006-03-23-The Approximation Argument

15 0.1350348 420 hunch net-2010-12-26-NIPS 2010

16 0.13168116 177 hunch net-2006-05-05-An ICML reject

17 0.13023402 67 hunch net-2005-05-06-Don’t mix the solution into the problem

18 0.12865824 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

19 0.12852216 371 hunch net-2009-09-21-Netflix finishes (and starts)

20 0.12427854 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.229), (1, 0.213), (2, 0.129), (3, -0.167), (4, -0.376), (5, 0.173), (6, -0.184), (7, 0.05), (8, 0.116), (9, 0.019), (10, 0.083), (11, -0.102), (12, -0.053), (13, 0.063), (14, 0.001), (15, 0.013), (16, 0.023), (17, -0.011), (18, -0.019), (19, 0.009), (20, 0.025), (21, 0.001), (22, -0.064), (23, -0.017), (24, 0.029), (25, 0.037), (26, -0.015), (27, 0.013), (28, 0.024), (29, 0.054), (30, 0.014), (31, 0.015), (32, 0.004), (33, -0.01), (34, -0.055), (35, -0.008), (36, 0.008), (37, -0.004), (38, -0.039), (39, 0.003), (40, 0.019), (41, -0.001), (42, -0.01), (43, 0.028), (44, 0.07), (45, -0.051), (46, 0.016), (47, -0.026), (48, -0.036), (49, 0.028)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99013561 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

2 0.9620173 9 hunch net-2005-02-01-Watchword: Loss

3 0.93550909 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

4 0.93016523 245 hunch net-2007-05-12-Loss Function Semantics

5 0.85998559 259 hunch net-2007-08-19-Choice of Metrics

Introduction: How do we judge success in Machine Learning? As Aaron notes , the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are: The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss. The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity. The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real wor

6 0.80312675 374 hunch net-2009-10-10-ALT 2009

7 0.78992271 274 hunch net-2007-11-28-Computational Consequences of Classification

8 0.78907299 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

9 0.74380994 129 hunch net-2005-11-07-Prediction Competitions

10 0.59568322 103 hunch net-2005-08-18-SVM Adaptability

11 0.57097489 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

12 0.5219177 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

13 0.50667679 67 hunch net-2005-05-06-Don’t mix the solution into the problem

14 0.50388074 371 hunch net-2009-09-21-Netflix finishes (and starts)

15 0.47704318 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

16 0.47034109 299 hunch net-2008-04-27-Watchword: Supervised Learning

17 0.45353493 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

18 0.44384736 165 hunch net-2006-03-23-The Approximation Argument

19 0.4436996 420 hunch net-2010-12-26-NIPS 2010

20 0.4225218 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.019), (3, 0.045), (10, 0.044), (22, 0.163), (24, 0.012), (27, 0.271), (38, 0.071), (50, 0.01), (53, 0.076), (55, 0.046), (92, 0.019), (94, 0.084), (95, 0.046)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.95035154 114 hunch net-2005-09-20-Workshop Proposal: Atomic Learning

Introduction: This is a proposal for a workshop. It may or may not happen depending on the level of interest. If you are interested, feel free to indicate so (by email or comments). Description: Assume(*) that any system for solving large difficult learning problems must decompose into repeated use of basic elements (i.e. atoms). There are many basic questions which remain: What are the viable basic elements? What makes a basic element viable? What are the viable principles for the composition of these basic elements? What are the viable principles for learning in such systems? What problems can this approach handle? Hal Daume adds: Can composition of atoms be (semi-) automatically constructed[?] When atoms are constructed through reductions, is there some notion of the “naturalness” of the created leaning problems? Other than Markov fields/graphical models/Bayes nets, is there a good language for representing atoms and their compositions? The answer to these a

2 0.9358843 358 hunch net-2009-06-01-Multitask Poisoning

Introduction: There are many ways that interesting research gets done. For example it’s common at a conference for someone to discuss a problem with a partial solution, and for someone else to know how to solve a piece of it, resulting in a paper. In some sense, these are the easiest results we can achieve, so we should ask: Can all research be this easy? The answer is certainly no for fields where research inherently requires experimentation to discover how the real world works. However, mathematics, including parts of physics, computer science, statistics, etc… which are effectively mathematics don’t require experimentation. In effect, a paper can be simply a pure expression of thinking. Can all mathematical-style research be this easy? What’s going on here is research-by-communication. Someone knows something, someone knows something else, and as soon as someone knows both things, a problem is solved. The interesting thing about research-by-communication is that it is becoming radic

same-blog 3 0.93397075 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

4 0.87690628 113 hunch net-2005-09-19-NIPS Workshops

Introduction: Attendance at the NIPS workshops is highly recommended for both research and learning. Unfortunately, there does not yet appear to be a public list of workshops. However, I found the following workshop webpages of interest: Machine Learning in Finance Learning to Rank Foundations of Active Learning Machine Learning Based Robotics in Unstructured Environments There are many more workshops. In fact, there are so many that it is not plausible anyone can attend every workshop they are interested in. Maybe in future years the organizers can spread them out over more days to reduce overlap. Many of these workshops are accepting presentation proposals (due mid-October).

5 0.86525524 12 hunch net-2005-02-03-Learning Theory, by assumption

Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.

6 0.86395156 41 hunch net-2005-03-15-The State of Tight Bounds

7 0.86197472 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

8 0.86108637 14 hunch net-2005-02-07-The State of the Reduction

9 0.85961366 95 hunch net-2005-07-14-What Learning Theory might do

10 0.8594594 235 hunch net-2007-03-03-All Models of Learning have Flaws

11 0.85921812 332 hunch net-2008-12-23-Use of Learning Theory

12 0.85828424 351 hunch net-2009-05-02-Wielding a New Abstraction

13 0.85772485 360 hunch net-2009-06-15-In Active Learning, the question changes

14 0.85765034 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

15 0.8572666 347 hunch net-2009-03-26-Machine Learning is too easy

16 0.85593289 258 hunch net-2007-08-12-Exponentiated Gradient

17 0.85458016 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

18 0.85299867 183 hunch net-2006-06-14-Explorations of Exploration

19 0.8527686 131 hunch net-2005-11-16-The Everything Ensemble Edge

20 0.85233718 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models