hunch_net hunch_net-2008 hunch_net-2008-327 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11 ). Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes. In particular, If there exists a set of weight vectors w i such that P(i|x)=, then for any invertible error correcting output code C , there exists weight vectors w c which decode to perfectly predict the probability of each class. The proof is simple and constructive: the weight vector w c can be constructed acc
sentIndex sentText sentNum sentScore
1 Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. [sent-1, score-0.36]
2 This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11 ). [sent-2, score-0.796]
3 Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. [sent-3, score-2.186]
4 This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes. [sent-4, score-1.54]
5 In particular, If there exists a set of weight vectors w i such that P(i|x)=, then for any invertible error correcting output code C , there exists weight vectors w c which decode to perfectly predict the probability of each class. [sent-5, score-2.087]
6 The proof is simple and constructive: the weight vector w c can be constructed according to the linear superposition of w i implied by the code, and invertibility implies that a correct encoding implies a correct decoding. [sent-6, score-1.218]
7 This observation extends to all-pairs like codes which compare subsets of choices to subsets of choices using “don’t cares”. [sent-7, score-1.338]
8 Using this observation, Daniel created a very short proof of the PECOC regret transform theorem ( here , and Daniel’s updated version ). [sent-8, score-0.277]
9 One further observation is that under ridge regression (a special case of linear regression), for any code, there exists a setting of parameters such that you might as well use one-against-all instead, because you get the same answer numerically. [sent-9, score-0.911]
10 The implication is that the advantages of codes more complex than one-against-all is confined to other prediction methods. [sent-10, score-0.632]
wordName wordTfidf (topN-words)
[('codes', 0.324), ('comfort', 0.244), ('regression', 0.235), ('linear', 0.231), ('correcting', 0.23), ('weight', 0.206), ('daniel', 0.202), ('comfortable', 0.189), ('vectors', 0.189), ('subsets', 0.189), ('complex', 0.158), ('code', 0.158), ('fear', 0.157), ('exists', 0.151), ('observation', 0.148), ('error', 0.125), ('choices', 0.124), ('output', 0.117), ('proof', 0.117), ('correct', 0.113), ('cares', 0.108), ('incoherent', 0.108), ('invertible', 0.108), ('pecoc', 0.1), ('encoding', 0.1), ('vs', 0.095), ('probability', 0.093), ('foster', 0.09), ('extends', 0.09), ('mathematically', 0.09), ('implied', 0.087), ('constructed', 0.087), ('dean', 0.087), ('perfectly', 0.084), ('implies', 0.082), ('implication', 0.081), ('updated', 0.081), ('predict', 0.08), ('using', 0.08), ('transform', 0.079), ('case', 0.077), ('constructive', 0.077), ('hsu', 0.075), ('compare', 0.07), ('classes', 0.069), ('advantages', 0.069), ('special', 0.069), ('wanted', 0.063), ('couple', 0.062), ('tutorial', 0.062)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression
Introduction: Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11 ). Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes. In particular, If there exists a set of weight vectors w i such that P(i|x)=, then for any invertible error correcting output code C , there exists weight vectors w c which decode to perfectly predict the probability of each class. The proof is simple and constructive: the weight vector w c can be constructed acc
2 0.1948086 181 hunch net-2006-05-23-What is the best regret transform reduction from multiclass to binary?
Introduction: This post is about an open problem in learning reductions. Background A reduction might transform a a multiclass prediction problem where there are k possible labels into a binary learning problem where there are only 2 possible labels. On this induced binary problem we might learn a binary classifier with some error rate e . After subtracting the minimum possible (Bayes) error rate b , we get a regret r = e – b . The PECOC (Probabilistic Error Correcting Output Code) reduction has the property that binary regret r implies multiclass regret at most 4r 0.5 . The problem This is not the “rightest” answer. Consider the k=2 case, where we reduce binary to binary. There exists a reduction (the identity) with the property that regret r implies regret r . This is substantially superior to the transform given by the PECOC reduction, which suggests that a better reduction may exist for general k . For example, we can not rule out the possibility that a reduction
3 0.19299772 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
Introduction: This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0 . An analagous result for reget is that for a probabilistic error correcting output code , multiclass regret is at most 4 * (binary regret) 0.5 . The use of “regret” is more desi
4 0.1866082 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
Introduction: For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered. Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Pr x,y~D (c(x) y) . The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, E x,y~D (s(x)-y) 2 . It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions. These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x) , then threshold at 0.5 . For this simple reduction a squared error regret of r
5 0.1681325 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity
Introduction: One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations. There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail. In contrast, if you go to a machine learning conference, a large number of the new algorithms are v
6 0.16583277 14 hunch net-2005-02-07-The State of the Reduction
7 0.11779842 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
8 0.097999342 308 hunch net-2008-07-06-To Dual or Not
9 0.096445404 361 hunch net-2009-06-24-Interesting papers at UAICMOLT 2009
10 0.095560551 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
11 0.093260899 388 hunch net-2010-01-24-Specializations of the Master Problem
12 0.092886552 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
13 0.092050113 351 hunch net-2009-05-02-Wielding a New Abstraction
14 0.085797518 274 hunch net-2007-11-28-Computational Consequences of Classification
15 0.084947459 140 hunch net-2005-12-14-More NIPS Papers II
16 0.083582014 444 hunch net-2011-09-07-KDD and MUCMD 2011
17 0.082857318 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)
18 0.07964325 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility
19 0.079471014 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project
20 0.078110583 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
topicId topicWeight
[(0, 0.153), (1, 0.139), (2, 0.048), (3, -0.074), (4, 0.012), (5, -0.025), (6, 0.087), (7, -0.065), (8, -0.133), (9, 0.012), (10, -0.091), (11, -0.059), (12, -0.009), (13, -0.065), (14, -0.019), (15, 0.0), (16, -0.065), (17, 0.045), (18, 0.079), (19, -0.027), (20, -0.039), (21, 0.018), (22, 0.124), (23, 0.027), (24, -0.014), (25, -0.07), (26, -0.05), (27, -0.082), (28, -0.124), (29, 0.047), (30, 0.002), (31, 0.047), (32, 0.064), (33, 0.049), (34, -0.012), (35, 0.067), (36, 0.007), (37, 0.026), (38, 0.004), (39, 0.023), (40, 0.009), (41, 0.07), (42, 0.004), (43, -0.044), (44, -0.082), (45, 0.057), (46, -0.045), (47, 0.005), (48, -0.025), (49, -0.103)]
simIndex simValue blogId blogTitle
same-blog 1 0.98982251 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression
Introduction: Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11 ). Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes. In particular, If there exists a set of weight vectors w i such that P(i|x)=, then for any invertible error correcting output code C , there exists weight vectors w c which decode to perfectly predict the probability of each class. The proof is simple and constructive: the weight vector w c can be constructed acc
2 0.63327652 181 hunch net-2006-05-23-What is the best regret transform reduction from multiclass to binary?
Introduction: This post is about an open problem in learning reductions. Background A reduction might transform a a multiclass prediction problem where there are k possible labels into a binary learning problem where there are only 2 possible labels. On this induced binary problem we might learn a binary classifier with some error rate e . After subtracting the minimum possible (Bayes) error rate b , we get a regret r = e – b . The PECOC (Probabilistic Error Correcting Output Code) reduction has the property that binary regret r implies multiclass regret at most 4r 0.5 . The problem This is not the “rightest” answer. Consider the k=2 case, where we reduce binary to binary. There exists a reduction (the identity) with the property that regret r implies regret r . This is substantially superior to the transform given by the PECOC reduction, which suggests that a better reduction may exist for general k . For example, we can not rule out the possibility that a reduction
3 0.61892271 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
Introduction: For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered. Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Pr x,y~D (c(x) y) . The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, E x,y~D (s(x)-y) 2 . It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions. These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x) , then threshold at 0.5 . For this simple reduction a squared error regret of r
4 0.58730489 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary
Introduction: This problem has been cracked (but not quite completely solved) by Alina , Pradeep , and I . The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here: For the single elimination tournament, we can prove that: For all multiclass problems D , for all learned binary classifiers c , the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log 2 k . Restated: reg multiclass (D,Filter_tree_test(c)) <= reg binary (Filter_tree_train(D),c) Here: Filter_tree_train(D) is the induced binary classification problem Filter_tree_test(c) is the induced multiclass classifier. reg multiclass is the multiclass regret (= difference between error rate and minim
5 0.58644009 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
Introduction: This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0 . An analagous result for reget is that for a probabilistic error correcting output code , multiclass regret is at most 4 * (binary regret) 0.5 . The use of “regret” is more desi
6 0.58125567 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
7 0.55845225 14 hunch net-2005-02-07-The State of the Reduction
8 0.5391857 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity
9 0.49467105 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
10 0.48378751 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time
11 0.48097071 218 hunch net-2006-11-20-Context and the calculation misperception
12 0.45394939 492 hunch net-2013-12-01-NIPS tutorials and Vowpal Wabbit 7.4
13 0.43901211 176 hunch net-2006-05-01-A conversation between Theo and Pat
14 0.4286702 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features
15 0.426368 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
16 0.42548147 473 hunch net-2012-09-29-Vowpal Wabbit, version 7.0
17 0.41794056 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility
18 0.41741642 361 hunch net-2009-06-24-Interesting papers at UAICMOLT 2009
19 0.40594527 62 hunch net-2005-04-26-To calibrate or not?
20 0.40150267 253 hunch net-2007-07-06-Idempotent-capable Predictors
topicId topicWeight
[(3, 0.012), (10, 0.028), (27, 0.322), (38, 0.127), (48, 0.307), (53, 0.012), (55, 0.044), (94, 0.043)]
simIndex simValue blogId blogTitle
same-blog 1 0.94372284 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression
Introduction: Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11 ). Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes. In particular, If there exists a set of weight vectors w i such that P(i|x)=, then for any invertible error correcting output code C , there exists weight vectors w c which decode to perfectly predict the probability of each class. The proof is simple and constructive: the weight vector w c can be constructed acc
2 0.92330521 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting
Introduction: This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time. The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding: N = E x ~ D f(x) given samples from D and knowledge of f . Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q . In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula: E x ~ Q f(x) D(x)/Q(x) A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions. Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1 , which is typically substantially better than the sample complexity of the orig
3 0.85044461 318 hunch net-2008-09-26-The SODA Program Committee
Introduction: Claire asked me to be on the SODA program committee this year, which was quite a bit of work. I had a relatively light load—merely 49 theory papers. Many of these papers were not on subjects that I was expert about, so (as is common for theory conferences) I found various reviewers that I trusted to help review the papers. I ended up reviewing about 1/3 personally. There were a couple instances where I ended up overruling a subreviewer whose logic seemed off, but otherwise I generally let their reviews stand. There are some differences in standards for paper reviews between the machine learning and theory communities. In machine learning it is expected that a review be detailed, while in the theory community this is often not the case. Every paper given to me ended up with a review varying between somewhat and very detailed. I’m sure not every author was happy with the outcome. While we did our best to make good decisions, they were difficult decisions to make. For exam
4 0.83050358 232 hunch net-2007-02-11-24
Introduction: To commemorate the Twenty Fourth Annual International Conference on Machine Learning (ICML-07), the FOX Network has decided to launch a new spin-off series in prime time. Through unofficial sources, I have obtained the story arc for the first season, which appears frighteningly realistic.
5 0.82648158 468 hunch net-2012-06-29-ICML survey and comments
Introduction: Just about nothing could keep me from attending ICML , except for Dora who arrived on Monday. Consequently, I have only secondhand reports that the conference is going well. For those who are remote (like me) or after the conference (like everyone), Mark Reid has setup the ICML discussion site where you can comment on any paper or subscribe to papers. Authors are automatically subscribed to their own papers, so it should be possible to have a discussion significantly after the fact, as people desire. We also conducted a survey before the conference and have the survey results now. This can be compared with the ICML 2010 survey results . Looking at the comparable questions, we can sometimes order the answers to have scores ranging from 0 to 3 or 0 to 4 with 3 or 4 being best and 0 worst, then compute the average difference between 2012 and 2010. Glancing through them, I see: Most people found the papers they reviewed a good fit for their expertise (-.037 w.r.t 20
6 0.8160882 445 hunch net-2011-09-28-Somebody’s Eating Your Lunch
7 0.77368832 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
8 0.75644267 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
9 0.75571156 259 hunch net-2007-08-19-Choice of Metrics
10 0.74750149 14 hunch net-2005-02-07-The State of the Reduction
11 0.73784608 251 hunch net-2007-06-24-Interesting Papers at ICML 2007
12 0.72910947 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
13 0.72865605 131 hunch net-2005-11-16-The Everything Ensemble Edge
14 0.72538871 41 hunch net-2005-03-15-The State of Tight Bounds
15 0.72490323 432 hunch net-2011-04-20-The End of the Beginning of Active Learning
16 0.72261137 133 hunch net-2005-11-28-A question of quantification
17 0.72254533 352 hunch net-2009-05-06-Machine Learning to AI
18 0.71451414 12 hunch net-2005-02-03-Learning Theory, by assumption
19 0.71418059 325 hunch net-2008-11-10-ICML Reviewing Criteria
20 0.71378332 304 hunch net-2008-06-27-Reviewing Horror Stories