hunch_net hunch_net-2009 hunch_net-2009-374 knowledge-graph by maker-knowledge-mining

374 hunch net-2009-10-10-ALT 2009


meta infos for this blog

Source: html

Introduction: I attended ALT (“Algorithmic Learning Theory”) for the first time this year. My impression is ALT = 0.5 COLT, by attendance and also by some more intangible “what do I get from it?” measure. There are many differences which can’t quite be described this way though. The program for ALT seems to be substantially more diverse than COLT, which is both a weakness and a strength. One paper that might interest people generally is: Alexey Chernov and Vladimir Vovk , Prediction with Expert Evaluators’ Advice . The basic observation here is that in the online learning with experts setting you can simultaneously compete with several compatible loss functions simultaneously. Restated, debating between competing with log loss and squared loss is a waste of breath, because it’s almost free to compete with them both simultaneously. This might interest anyone who has run into “which loss function?” debates that come up periodically.


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 I attended ALT (“Algorithmic Learning Theory”) for the first time this year. [sent-1, score-0.16]

2 5 COLT, by attendance and also by some more intangible “what do I get from it? [sent-3, score-0.174]

3 There are many differences which can’t quite be described this way though. [sent-5, score-0.365]

4 The program for ALT seems to be substantially more diverse than COLT, which is both a weakness and a strength. [sent-6, score-0.431]

5 One paper that might interest people generally is: Alexey Chernov and Vladimir Vovk , Prediction with Expert Evaluators’ Advice . [sent-7, score-0.331]

6 The basic observation here is that in the online learning with experts setting you can simultaneously compete with several compatible loss functions simultaneously. [sent-8, score-1.271]

7 Restated, debating between competing with log loss and squared loss is a waste of breath, because it’s almost free to compete with them both simultaneously. [sent-9, score-1.676]

8 This might interest anyone who has run into “which loss function? [sent-10, score-0.665]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('alt', 0.512), ('loss', 0.294), ('compete', 0.254), ('breath', 0.184), ('periodically', 0.184), ('debating', 0.184), ('weakness', 0.171), ('waste', 0.171), ('vladimir', 0.154), ('vovk', 0.154), ('colt', 0.152), ('compatible', 0.147), ('interest', 0.138), ('differences', 0.134), ('described', 0.134), ('competing', 0.134), ('restated', 0.127), ('expert', 0.124), ('simultaneously', 0.122), ('impression', 0.119), ('attendance', 0.119), ('algorithmic', 0.115), ('attended', 0.115), ('advice', 0.113), ('diverse', 0.113), ('squared', 0.109), ('experts', 0.106), ('functions', 0.09), ('free', 0.085), ('observation', 0.084), ('log', 0.084), ('might', 0.08), ('run', 0.077), ('anyone', 0.076), ('substantially', 0.076), ('program', 0.071), ('function', 0.07), ('generally', 0.069), ('setting', 0.068), ('almost', 0.067), ('come', 0.066), ('online', 0.058), ('prediction', 0.057), ('theory', 0.055), ('get', 0.055), ('quite', 0.051), ('basic', 0.048), ('way', 0.046), ('first', 0.045), ('paper', 0.044)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 374 hunch net-2009-10-10-ALT 2009

Introduction: I attended ALT (“Algorithmic Learning Theory”) for the first time this year. My impression is ALT = 0.5 COLT, by attendance and also by some more intangible “what do I get from it?” measure. There are many differences which can’t quite be described this way though. The program for ALT seems to be substantially more diverse than COLT, which is both a weakness and a strength. One paper that might interest people generally is: Alexey Chernov and Vladimir Vovk , Prediction with Expert Evaluators’ Advice . The basic observation here is that in the online learning with experts setting you can simultaneously compete with several compatible loss functions simultaneously. Restated, debating between competing with log loss and squared loss is a waste of breath, because it’s almost free to compete with them both simultaneously. This might interest anyone who has run into “which loss function?” debates that come up periodically.

2 0.25236672 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

3 0.23644122 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

4 0.20448059 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

5 0.19773073 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

6 0.17838493 274 hunch net-2007-11-28-Computational Consequences of Classification

7 0.16606721 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

8 0.15340567 259 hunch net-2007-08-19-Choice of Metrics

9 0.1128419 89 hunch net-2005-07-04-The Health of COLT

10 0.1111318 129 hunch net-2005-11-07-Prediction Competitions

11 0.10867538 388 hunch net-2010-01-24-Specializations of the Master Problem

12 0.10817245 371 hunch net-2009-09-21-Netflix finishes (and starts)

13 0.1059173 453 hunch net-2012-01-28-Why COLT?

14 0.096015528 324 hunch net-2008-11-09-A Healthy COLT

15 0.093759723 369 hunch net-2009-08-27-New York Area Machine Learning Events

16 0.091276579 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

17 0.087232515 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

18 0.086109951 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

19 0.085313641 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

20 0.084548347 103 hunch net-2005-08-18-SVM Adaptability


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.162), (1, 0.064), (2, 0.077), (3, -0.164), (4, -0.216), (5, 0.129), (6, -0.152), (7, -0.018), (8, 0.017), (9, 0.063), (10, 0.155), (11, -0.01), (12, -0.034), (13, 0.062), (14, 0.062), (15, 0.007), (16, 0.122), (17, -0.034), (18, -0.016), (19, 0.047), (20, 0.053), (21, 0.043), (22, 0.015), (23, 0.047), (24, -0.001), (25, 0.031), (26, 0.009), (27, 0.044), (28, 0.044), (29, 0.006), (30, -0.019), (31, -0.011), (32, 0.007), (33, 0.058), (34, 0.004), (35, -0.013), (36, 0.044), (37, 0.03), (38, 0.006), (39, -0.074), (40, -0.035), (41, 0.065), (42, -0.023), (43, -0.013), (44, -0.047), (45, -0.089), (46, 0.044), (47, -0.092), (48, -0.012), (49, 0.022)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98313642 374 hunch net-2009-10-10-ALT 2009

Introduction: I attended ALT (“Algorithmic Learning Theory”) for the first time this year. My impression is ALT = 0.5 COLT, by attendance and also by some more intangible “what do I get from it?” measure. There are many differences which can’t quite be described this way though. The program for ALT seems to be substantially more diverse than COLT, which is both a weakness and a strength. One paper that might interest people generally is: Alexey Chernov and Vladimir Vovk , Prediction with Expert Evaluators’ Advice . The basic observation here is that in the online learning with experts setting you can simultaneously compete with several compatible loss functions simultaneously. Restated, debating between competing with log loss and squared loss is a waste of breath, because it’s almost free to compete with them both simultaneously. This might interest anyone who has run into “which loss function?” debates that come up periodically.

2 0.78256971 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

3 0.78147089 245 hunch net-2007-05-12-Loss Function Semantics

Introduction: Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself. Optimizing squared loss l sq (y,y’)=(y-y’) 2 means predicting the (conditional) mean of y . Optimizing absolute value loss l av (y,y’)=|y-y’| means predicting the (conditional) median of y . Variants can handle other quantiles . 0/1 loss for classification is a special case. Optimizing log loss l log (y,y’)=log (1/Pr z~y’ (z=y)) means minimizing the description length of y . The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form: For all distributions D over Y , if y’ = arg min y’ E y ~ D l sq (y,y’) then y’ = E y~D y Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y . There are 3 points to this post. Everyone doing general machine lear

4 0.77808797 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

5 0.77128899 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

Introduction: Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time? When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set. The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun (applied) and Vladimir Vapnik (theoretical) advocate: “solve the simplest prediction problem that s

6 0.68184906 259 hunch net-2007-08-19-Choice of Metrics

7 0.65597433 274 hunch net-2007-11-28-Computational Consequences of Classification

8 0.58000916 129 hunch net-2005-11-07-Prediction Competitions

9 0.5646323 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

10 0.50845367 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

11 0.48573151 324 hunch net-2008-11-09-A Healthy COLT

12 0.4552497 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

13 0.43294153 371 hunch net-2009-09-21-Netflix finishes (and starts)

14 0.42382172 89 hunch net-2005-07-04-The Health of COLT

15 0.4189176 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

16 0.41030249 394 hunch net-2010-04-24-COLT Treasurer is now Phil Long

17 0.4032664 308 hunch net-2008-07-06-To Dual or Not

18 0.40160882 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

19 0.39608616 258 hunch net-2007-08-12-Exponentiated Gradient

20 0.39553446 453 hunch net-2012-01-28-Why COLT?


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(27, 0.27), (55, 0.113), (66, 0.413), (94, 0.07)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.87971866 374 hunch net-2009-10-10-ALT 2009

Introduction: I attended ALT (“Algorithmic Learning Theory”) for the first time this year. My impression is ALT = 0.5 COLT, by attendance and also by some more intangible “what do I get from it?” measure. There are many differences which can’t quite be described this way though. The program for ALT seems to be substantially more diverse than COLT, which is both a weakness and a strength. One paper that might interest people generally is: Alexey Chernov and Vladimir Vovk , Prediction with Expert Evaluators’ Advice . The basic observation here is that in the online learning with experts setting you can simultaneously compete with several compatible loss functions simultaneously. Restated, debating between competing with log loss and squared loss is a waste of breath, because it’s almost free to compete with them both simultaneously. This might interest anyone who has run into “which loss function?” debates that come up periodically.

2 0.82918537 385 hunch net-2009-12-27-Interesting things at NIPS 2009

Introduction: Several papers at NIPS caught my attention. Elad Hazan and Satyen Kale , Online Submodular Optimization They define an algorithm for online optimization of submodular functions with regret guarantees. This places submodular optimization roughly on par with online convex optimization as tractable settings for online learning. Elad Hazan and Satyen Kale On Stochastic and Worst-Case Models of Investing . At it’s core, this is yet another example of modifying worst-case online learning to deal with variance, but the application to financial models is particularly cool and it seems plausibly superior other common approaches for financial modeling. Mark Palatucci , Dean Pomerlau , Tom Mitchell , and Geoff Hinton Zero Shot Learning with Semantic Output Codes The goal here is predicting a label in a multiclass supervised setting where the label never occurs in the training data. They have some basic analysis and also a nice application to FMRI brain reading. Sh

3 0.70735401 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

Introduction: A few weeks ago I read this . David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it): In short I was thinking that “bayesian belief updating” and “maximum entropy” were two othogonal principles. But it appear that they are not, and that they can even be in conflict ! Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5. Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). Now consider a new piece of evidence A=”X is an odd number” Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0). But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A)) Indeed, for MaxEnt, because there is no more ‘6′, big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesn’t have to have a 3.5

4 0.57454956 304 hunch net-2008-06-27-Reviewing Horror Stories

Introduction: Essentially everyone who writes research papers suffers rejections. They always sting immediately, but upon further reflection many of these rejections come to seem reasonable. Maybe the equations had too many typos or maybe the topic just isn’t as important as was originally thought. A few rejections do not come to seem acceptable, and these form the basis of reviewing horror stories, a great material for conversations. I’ve decided to share three of mine, now all safely a bit distant in the past. Prediction Theory for Classification Tutorial . This is a tutorial about tight sample complexity bounds for classification that I submitted to JMLR . The first decision I heard was a reject which appeared quite unjust to me—for example one of the reviewers appeared to claim that all the content was in standard statistics books. Upon further inquiry, several citations were given, none of which actually covered the content. Later, I was shocked to hear the paper was accepted. App

5 0.5705235 325 hunch net-2008-11-10-ICML Reviewing Criteria

Introduction: Michael Littman and Leon Bottou have decided to use a franchise program chair approach to reviewing at ICML this year. I’ll be one of the area chairs, so I wanted to mention a few things if you are thinking about naming me. I take reviewing seriously. That means papers to be reviewed are read, the implications are considered, and decisions are only made after that. I do my best to be fair, and there are zero subjects that I consider categorical rejects. I don’t consider several arguments for rejection-not-on-the-merits reasonable . I am generally interested in papers that (a) analyze new models of machine learning, (b) provide new algorithms, and (c) show that they work empirically on plausibly real problems. If a paper has the trifecta, I’m particularly interested. With 2 out of 3, I might be interested. I often find papers with only one element harder to accept, including papers with just (a). I’m a bit tough. I rarely jump-up-and-down about a paper, because I b

6 0.5667389 293 hunch net-2008-03-23-Interactive Machine Learning

7 0.56359649 51 hunch net-2005-04-01-The Producer-Consumer Model of Research

8 0.56354171 378 hunch net-2009-11-15-The Other Online Learning

9 0.5631007 220 hunch net-2006-11-27-Continuizing Solutions

10 0.5628202 352 hunch net-2009-05-06-Machine Learning to AI

11 0.56174225 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

12 0.55975986 343 hunch net-2009-02-18-Decision by Vetocracy

13 0.55880141 252 hunch net-2007-07-01-Watchword: Online Learning

14 0.55773151 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?

15 0.55772555 379 hunch net-2009-11-23-ICML 2009 Workshops (and Tutorials)

16 0.55735236 218 hunch net-2006-11-20-Context and the calculation misperception

17 0.55464315 44 hunch net-2005-03-21-Research Styles in Machine Learning

18 0.55449986 225 hunch net-2007-01-02-Retrospective

19 0.55435747 164 hunch net-2006-03-17-Multitask learning is Black-Boxable

20 0.55345732 360 hunch net-2009-06-15-In Active Learning, the question changes