hunch_net hunch_net-2006 hunch_net-2006-196 knowledge-graph by maker-knowledge-mining

196 hunch net-2006-07-13-Regression vs. Classification as a Primitive


meta infos for this blog

Source: html

Introduction: For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered. Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Pr x,y~D (c(x) y) . The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, E x,y~D (s(x)-y) 2 . It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions. These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x) , then threshold at 0.5 . For this simple reduction a squared error regret of r


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. [sent-1, score-0.23]

2 The primary alternative candidate seems to be squared error regression. [sent-4, score-0.814]

3 In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, E x,y~D (s(x)-y) 2 . [sent-5, score-1.512]

4 The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions. [sent-7, score-0.205]

5 Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x) , then threshold at 0. [sent-9, score-0.826]

6 For this simple reduction a squared error regret of r implies a classification regret of at most r 0. [sent-11, score-1.421]

7 Regression can also be used to reduce to classification using the Probing algorithm . [sent-13, score-0.292]

8 ) Under this reduction, a classification regret of r implies a squared error regression regret of at most r . [sent-15, score-1.817]

9 Both primitives enjoy a significant amount of prior work with (perhaps) classification enjoying more work in the machine learning community and regression having more emphasis in the statistics community. [sent-16, score-0.966]

10 The (nonhistoric) reasons for preferring classification seem to be: Aesthetically, what could be a more natural primitive than predicting a single bit? [sent-17, score-0.695]

11 When translated into transistors, a classification algorithm might use fewer transistors than a regressor, simply because the natural representation is bits rather than reals (~= floats). [sent-19, score-0.547]

12 There are several reasons to consider regression over classification: More uniform convergence. [sent-20, score-0.51]

13 For a single squared error regressor, the rate of convergence of the estimate of squared error to the expected squared error goes as 1/m, assuming IID samples. [sent-21, score-2.722]

14 For a single binary classifier, the rate of convergence of the estimate of binary loss to the expected binary loss goes as a function varying between 1/m and 1/m 0. [sent-22, score-1.186]

15 In particular, the Probing algorithm must call a classifier several times in several ways to make a high precision regression prediction. [sent-25, score-0.675]

16 On the other hand, classification via regression requires one call to the underlying regressor. [sent-26, score-0.877]

17 Squared error regression learning is often easier than 0/1 classification learning. [sent-27, score-1.051]

18 This is becaue squared error regression is convex, but 0/1 loss is not. [sent-28, score-1.384]

19 Note: this does not imply that squared error regression is convex (It isn’t for general regression algorithms). [sent-29, score-1.76]

20 The mathematical evidence points toward squared error regression as a better primitive, although doubts still seem reasonable to entertain. [sent-31, score-1.337]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('squared', 0.461), ('regression', 0.458), ('error', 0.301), ('classification', 0.292), ('regressor', 0.208), ('binary', 0.165), ('primitives', 0.156), ('primitive', 0.156), ('transistors', 0.146), ('regret', 0.128), ('probing', 0.114), ('loss', 0.104), ('classifier', 0.09), ('single', 0.086), ('convex', 0.082), ('convergence', 0.082), ('goes', 0.081), ('minimize', 0.081), ('call', 0.078), ('estimate', 0.077), ('questioned', 0.065), ('concentrating', 0.065), ('representationally', 0.065), ('reduction', 0.062), ('obvious', 0.062), ('expected', 0.061), ('preferring', 0.06), ('doubts', 0.06), ('translated', 0.06), ('becaue', 0.06), ('enforced', 0.06), ('entertain', 0.06), ('enjoy', 0.06), ('toward', 0.057), ('grounds', 0.054), ('choice', 0.052), ('threshold', 0.052), ('pr', 0.052), ('candidate', 0.052), ('reasons', 0.052), ('judgement', 0.05), ('natural', 0.049), ('rate', 0.049), ('implies', 0.049), ('precision', 0.049), ('underlying', 0.049), ('updated', 0.049), ('evaluating', 0.049), ('varying', 0.047), ('reduced', 0.046)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999988 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

Introduction: For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered. Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Pr x,y~D (c(x) y) . The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, E x,y~D (s(x)-y) 2 . It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions. These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x) , then threshold at 0.5 . For this simple reduction a squared error regret of r

2 0.43613967 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

3 0.31093696 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

Introduction: This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0 . An analagous result for reget is that for a probabilistic error correcting output code , multiclass regret is at most 4 * (binary regret) 0.5 . The use of “regret” is more desi

4 0.26199201 181 hunch net-2006-05-23-What is the best regret transform reduction from multiclass to binary?

Introduction: This post is about an open problem in learning reductions. Background A reduction might transform a a multiclass prediction problem where there are k possible labels into a binary learning problem where there are only 2 possible labels. On this induced binary problem we might learn a binary classifier with some error rate e . After subtracting the minimum possible (Bayes) error rate b , we get a regret r = e – b . The PECOC (Probabilistic Error Correcting Output Code) reduction has the property that binary regret r implies multiclass regret at most 4r 0.5 . The problem This is not the “rightest” answer. Consider the k=2 case, where we reduce binary to binary. There exists a reduction (the identity) with the property that regret r implies regret r . This is substantially superior to the transform given by the PECOC reduction, which suggests that a better reduction may exist for general k . For example, we can not rule out the possibility that a reduction

5 0.25840333 14 hunch net-2005-02-07-The State of the Reduction

Introduction: What? Reductions are machines which turn solvers for one problem into solvers for another problem. Why? Reductions are useful for several reasons. Laziness . Reducing a problem to classification make at least 10 learning algorithms available to solve a problem. Inventing 10 learning algorithms is quite a bit of work. Similarly, programming a reduction is often trivial, while programming a learning algorithm is a great deal of work. Crystallization . The problems we often want to solve in learning are worst-case-impossible, but average case feasible. By reducing all problems onto one or a few primitives, we can fine tune these primitives to perform well on real-world problems with greater precision due to the greater number of problems to validate on. Theoretical Organization . By studying what reductions are easy vs. hard vs. impossible, we can learn which problems are roughly equivalent in difficulty and which are much harder. What we know now . Typesafe r

6 0.2258358 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

7 0.20601578 245 hunch net-2007-05-12-Loss Function Semantics

8 0.1866082 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression

9 0.17950208 9 hunch net-2005-02-01-Watchword: Loss

10 0.15966251 361 hunch net-2009-06-24-Interesting papers at UAICMOLT 2009

11 0.15707658 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

12 0.15336205 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary

13 0.13653855 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

14 0.13128421 371 hunch net-2009-09-21-Netflix finishes (and starts)

15 0.1283052 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

16 0.12566233 27 hunch net-2005-02-23-Problem: Reinforcement Learning with Classification

17 0.12396877 259 hunch net-2007-08-19-Choice of Metrics

18 0.12320711 78 hunch net-2005-06-06-Exact Online Learning for Classification

19 0.11420114 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

20 0.11176896 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.192), (1, 0.218), (2, 0.16), (3, -0.198), (4, -0.221), (5, 0.03), (6, 0.143), (7, -0.065), (8, -0.196), (9, -0.043), (10, -0.091), (11, -0.055), (12, -0.033), (13, -0.061), (14, -0.031), (15, -0.004), (16, -0.059), (17, 0.044), (18, 0.031), (19, -0.023), (20, 0.04), (21, 0.047), (22, 0.068), (23, 0.092), (24, 0.017), (25, -0.057), (26, 0.028), (27, -0.018), (28, -0.087), (29, -0.036), (30, -0.016), (31, -0.027), (32, -0.044), (33, -0.05), (34, 0.094), (35, 0.079), (36, 0.072), (37, -0.063), (38, 0.069), (39, -0.033), (40, 0.036), (41, -0.07), (42, 0.034), (43, 0.024), (44, -0.075), (45, 0.053), (46, 0.025), (47, -0.054), (48, 0.05), (49, -0.089)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99012727 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

Introduction: For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered. Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Pr x,y~D (c(x) y) . The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, E x,y~D (s(x)-y) 2 . It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions. These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x) , then threshold at 0.5 . For this simple reduction a squared error regret of r

2 0.77891493 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

3 0.76467925 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

Introduction: This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0 . An analagous result for reget is that for a probabilistic error correcting output code , multiclass regret is at most 4 * (binary regret) 0.5 . The use of “regret” is more desi

4 0.72240275 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary

Introduction: This problem has been cracked (but not quite completely solved) by Alina , Pradeep , and I . The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here: For the single elimination tournament, we can prove that: For all multiclass problems D , for all learned binary classifiers c , the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log 2 k . Restated: reg multiclass (D,Filter_tree_test(c)) <= reg binary (Filter_tree_train(D),c) Here: Filter_tree_train(D) is the induced binary classification problem Filter_tree_test(c) is the induced multiclass classifier. reg multiclass is the multiclass regret (= difference between error rate and minim

5 0.7031092 181 hunch net-2006-05-23-What is the best regret transform reduction from multiclass to binary?

Introduction: This post is about an open problem in learning reductions. Background A reduction might transform a a multiclass prediction problem where there are k possible labels into a binary learning problem where there are only 2 possible labels. On this induced binary problem we might learn a binary classifier with some error rate e . After subtracting the minimum possible (Bayes) error rate b , we get a regret r = e – b . The PECOC (Probabilistic Error Correcting Output Code) reduction has the property that binary regret r implies multiclass regret at most 4r 0.5 . The problem This is not the “rightest” answer. Consider the k=2 case, where we reduce binary to binary. There exists a reduction (the identity) with the property that regret r implies regret r . This is substantially superior to the transform given by the PECOC reduction, which suggests that a better reduction may exist for general k . For example, we can not rule out the possibility that a reduction

6 0.66421187 14 hunch net-2005-02-07-The State of the Reduction

7 0.64099199 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression

8 0.62810367 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

9 0.6252147 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

10 0.53865051 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

11 0.5380795 67 hunch net-2005-05-06-Don’t mix the solution into the problem

12 0.53806931 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

13 0.53103554 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

14 0.5295856 218 hunch net-2006-11-20-Context and the calculation misperception

15 0.50081831 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC

16 0.50017565 259 hunch net-2007-08-19-Choice of Metrics

17 0.4835209 245 hunch net-2007-05-12-Loss Function Semantics

18 0.47850502 9 hunch net-2005-02-01-Watchword: Loss

19 0.47611755 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

20 0.47434583 78 hunch net-2005-06-06-Exact Online Learning for Classification


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.137), (3, 0.028), (27, 0.399), (38, 0.082), (48, 0.03), (51, 0.011), (53, 0.013), (55, 0.086), (94, 0.055), (95, 0.044)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96676767 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

Introduction: For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered. Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Pr x,y~D (c(x) y) . The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, E x,y~D (s(x)-y) 2 . It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions. These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x) , then threshold at 0.5 . For this simple reduction a squared error regret of r

2 0.94904238 8 hunch net-2005-02-01-NIPS: Online Bayes

Introduction: One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner —usually, Nature could provide examples in an adversarial manner. We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. Then for all sequences S: L_{Bayes}(S) leq L_Q(S)

3 0.93665797 76 hunch net-2005-05-29-Bad ideas

Introduction: I found these two essays on bad ideas interesting. Neither of these is written from the viewpoint of research, but they are both highly relevant. Why smart people have bad ideas by Paul Graham Why smart people defend bad ideas by Scott Berkun (which appeared on slashdot ) In my experience, bad ideas are common and over confidence in ideas is common. This overconfidence can take either the form of excessive condemnation or excessive praise. Some of this is necessary to the process of research. For example, some overconfidence in the value of your own research is expected and probably necessary to motivate your own investigation. Since research is a rather risky business, much of it does not pan out. Learning to accept when something does not pan out is a critical skill which is sometimes never acquired. Excessive condemnation can be a real ill when it’s encountered. This has two effects: When the penalty for being wrong is too large, it means people have a

4 0.93651062 304 hunch net-2008-06-27-Reviewing Horror Stories

Introduction: Essentially everyone who writes research papers suffers rejections. They always sting immediately, but upon further reflection many of these rejections come to seem reasonable. Maybe the equations had too many typos or maybe the topic just isn’t as important as was originally thought. A few rejections do not come to seem acceptable, and these form the basis of reviewing horror stories, a great material for conversations. I’ve decided to share three of mine, now all safely a bit distant in the past. Prediction Theory for Classification Tutorial . This is a tutorial about tight sample complexity bounds for classification that I submitted to JMLR . The first decision I heard was a reject which appeared quite unjust to me—for example one of the reviewers appeared to claim that all the content was in standard statistics books. Upon further inquiry, several citations were given, none of which actually covered the content. Later, I was shocked to hear the paper was accepted. App

5 0.93399864 352 hunch net-2009-05-06-Machine Learning to AI

Introduction: I recently had fun discussions with both Vikash Mansinghka and Thomas Breuel about approaching AI with machine learning. The general interest in taking a crack at AI with machine learning seems to be rising on many fronts including DARPA . As a matter of history, there was a great deal of interest in AI which died down before I began research. There remain many projects and conferences spawned in this earlier AI wave, as well as a good bit of experience about what did not work, or at least did not work yet. Here are a few examples of failure modes that people seem to run into: Supply/Product confusion . Sometimes we think “Intelligences use X, so I’ll create X and have an Intelligence.” An example of this is the Cyc Project which inspires some people as “intelligences use ontologies, so I’ll create an ontology and a system using it to have an Intelligence.” The flaw here is that Intelligences create ontologies, which they use, and without the ability to create ont

6 0.93160397 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?

7 0.92288077 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

8 0.92167705 45 hunch net-2005-03-22-Active learning

9 0.92028046 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

10 0.91996557 259 hunch net-2007-08-19-Choice of Metrics

11 0.91748583 14 hunch net-2005-02-07-The State of the Reduction

12 0.9170723 288 hunch net-2008-02-10-Complexity Illness

13 0.91674495 432 hunch net-2011-04-20-The End of the Beginning of Active Learning

14 0.91669005 378 hunch net-2009-11-15-The Other Online Learning

15 0.9159438 172 hunch net-2006-04-14-JMLR is a success

16 0.91589004 220 hunch net-2006-11-27-Continuizing Solutions

17 0.91499364 293 hunch net-2008-03-23-Interactive Machine Learning

18 0.91397899 308 hunch net-2008-07-06-To Dual or Not

19 0.91391259 360 hunch net-2009-06-15-In Active Learning, the question changes

20 0.9136377 400 hunch net-2010-06-13-The Good News on Exploration and Learning