hunch_net hunch_net-2006 hunch_net-2006-176 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Pat (the practitioner) I need to do multiclass classification and I only have a decision tree. Theo (the thoeretician) Use an error correcting output code . Pat Oh, that’s cool. But the created binary problems seem unintuitive. I’m not sure the decision tree can solve them. Theo Oh? Is your problem a decision list? Pat No, I don’t think so. Theo Hmm. Are the classes well separated by axis aligned splits? Pat Err, maybe. I’m not sure. Theo Well, if they are, under the IID assumption I can tell you how many samples you need. Pat IID? The data is definitely not IID. Theo Oh dear. Pat Can we get back to the choice of ECOC? I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other. Theo Ok. What do you know about your problem? Pat Not much. My friend just gave me the dataset. Theo Then, no one can help you. Pat (What a fuzzy thinker. Theo keeps jumping t
sentIndex sentText sentNum sentScore
1 Pat (the practitioner) I need to do multiclass classification and I only have a decision tree. [sent-1, score-0.134]
2 Theo (the thoeretician) Use an error correcting output code . [sent-2, score-0.072]
3 Are the classes well separated by axis aligned splits? [sent-10, score-0.142]
4 Theo Well, if they are, under the IID assumption I can tell you how many samples you need. [sent-13, score-0.068]
5 I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other. [sent-18, score-0.318]
6 Theo keeps jumping to assumptions that just aren’t true. [sent-25, score-0.109]
7 Pat’s problem is unsolvable without making extra assumptions. [sent-27, score-0.084]
8 ) I’ve heard variants of this conversation several times. [sent-28, score-0.126]
9 The fundamental difference in viewpoint is the following: Theo lives in a world where he chooses the problem to solve based upon learning model (and assumptions) used. [sent-29, score-0.327]
10 Pat lives in a world where the problem is imposed on him. [sent-30, score-0.223]
11 I’d love for these confusions to go away, but there is no magic wand. [sent-31, score-0.107]
12 The best advice seems to be: listen carefully and avoid assuming to much in what you hear. [sent-32, score-0.132]
wordName wordTfidf (topN-words)
[('pat', 0.646), ('theo', 0.646), ('oh', 0.194), ('fuzzy', 0.129), ('lives', 0.089), ('iid', 0.062), ('assumptions', 0.061), ('decision', 0.059), ('listen', 0.057), ('magic', 0.057), ('axis', 0.053), ('aligned', 0.053), ('dynamically', 0.053), ('practitioner', 0.053), ('problem', 0.052), ('love', 0.05), ('ecoc', 0.05), ('subsets', 0.05), ('err', 0.048), ('keeps', 0.048), ('friend', 0.044), ('heard', 0.044), ('suspect', 0.043), ('conversation', 0.043), ('world', 0.043), ('need', 0.043), ('correcting', 0.041), ('chooses', 0.041), ('hear', 0.041), ('solve', 0.04), ('assuming', 0.04), ('definitely', 0.039), ('imposed', 0.039), ('variants', 0.039), ('tell', 0.037), ('classes', 0.036), ('gave', 0.036), ('advice', 0.035), ('away', 0.035), ('response', 0.034), ('extra', 0.032), ('labels', 0.032), ('multiclass', 0.032), ('viewpoint', 0.032), ('empirically', 0.032), ('assumption', 0.031), ('created', 0.031), ('output', 0.031), ('build', 0.031), ('upon', 0.03)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 176 hunch net-2006-05-01-A conversation between Theo and Pat
Introduction: Pat (the practitioner) I need to do multiclass classification and I only have a decision tree. Theo (the thoeretician) Use an error correcting output code . Pat Oh, that’s cool. But the created binary problems seem unintuitive. I’m not sure the decision tree can solve them. Theo Oh? Is your problem a decision list? Pat No, I don’t think so. Theo Hmm. Are the classes well separated by axis aligned splits? Pat Err, maybe. I’m not sure. Theo Well, if they are, under the IID assumption I can tell you how many samples you need. Pat IID? The data is definitely not IID. Theo Oh dear. Pat Can we get back to the choice of ECOC? I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other. Theo Ok. What do you know about your problem? Pat Not much. My friend just gave me the dataset. Theo Then, no one can help you. Pat (What a fuzzy thinker. Theo keeps jumping t
2 0.047789659 181 hunch net-2006-05-23-What is the best regret transform reduction from multiclass to binary?
Introduction: This post is about an open problem in learning reductions. Background A reduction might transform a a multiclass prediction problem where there are k possible labels into a binary learning problem where there are only 2 possible labels. On this induced binary problem we might learn a binary classifier with some error rate e . After subtracting the minimum possible (Bayes) error rate b , we get a regret r = e – b . The PECOC (Probabilistic Error Correcting Output Code) reduction has the property that binary regret r implies multiclass regret at most 4r 0.5 . The problem This is not the “rightest” answer. Consider the k=2 case, where we reduce binary to binary. There exists a reduction (the identity) with the property that regret r implies regret r . This is substantially superior to the transform given by the PECOC reduction, which suggests that a better reduction may exist for general k . For example, we can not rule out the possibility that a reduction
3 0.047321562 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
Introduction: This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0 . An analagous result for reget is that for a probabilistic error correcting output code , multiclass regret is at most 4 * (binary regret) 0.5 . The use of “regret” is more desi
4 0.046823461 7 hunch net-2005-01-31-Watchword: Assumption
Introduction: “Assumption” is another word to be careful with in machine learning because it is used in several ways. Assumption = Bias There are several ways to see that some form of ‘bias’ (= preferring of one solution over another) is necessary. This is obvious in an adversarial setting. A good bit of work has been expended explaining this in other settings with “ no free lunch ” theorems. This is a usage specialized to learning which is particularly common when talking about priors for Bayesian Learning. Assumption = “if” of a theorem The assumptions are the ‘if’ part of the ‘if-then’ in a theorem. This is a fairly common usage. Assumption = Axiom The assumptions are the things that we assume are true, but which we cannot verify. Examples are “the IID assumption” or “my problem is a DNF on a small number of bits”. This is the usage which I prefer. One difficulty with any use of the word “assumption” is that you often encounter “if assumption then conclusion so if no
5 0.043601964 235 hunch net-2007-03-03-All Models of Learning have Flaws
Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c
6 0.04351129 12 hunch net-2005-02-03-Learning Theory, by assumption
7 0.042897366 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction
8 0.036528982 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary
9 0.036474016 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression
10 0.035763405 127 hunch net-2005-11-02-Progress in Active Learning
11 0.035276838 57 hunch net-2005-04-16-Which Assumptions are Reasonable?
12 0.034849558 304 hunch net-2008-06-27-Reviewing Horror Stories
13 0.031468723 183 hunch net-2006-06-14-Explorations of Exploration
14 0.030123109 14 hunch net-2005-02-07-The State of the Reduction
15 0.029959016 347 hunch net-2009-03-26-Machine Learning is too easy
16 0.02911748 118 hunch net-2005-10-07-On-line learning of regular decision rules
17 0.028773673 67 hunch net-2005-05-06-Don’t mix the solution into the problem
18 0.027542043 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
19 0.026746484 351 hunch net-2009-05-02-Wielding a New Abstraction
20 0.0266222 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
topicId topicWeight
[(0, 0.062), (1, 0.035), (2, 0.012), (3, -0.0), (4, 0.015), (5, -0.009), (6, 0.044), (7, 0.001), (8, -0.023), (9, -0.002), (10, -0.018), (11, 0.002), (12, 0.023), (13, -0.007), (14, -0.0), (15, -0.02), (16, -0.028), (17, 0.011), (18, 0.001), (19, 0.013), (20, 0.012), (21, 0.015), (22, 0.021), (23, 0.005), (24, -0.007), (25, -0.014), (26, -0.003), (27, -0.037), (28, 0.009), (29, -0.01), (30, 0.024), (31, 0.013), (32, -0.026), (33, 0.034), (34, -0.012), (35, 0.029), (36, 0.02), (37, 0.012), (38, 0.014), (39, -0.034), (40, -0.015), (41, 0.057), (42, -0.058), (43, -0.044), (44, 0.035), (45, -0.001), (46, -0.005), (47, -0.041), (48, 0.028), (49, 0.044)]
simIndex simValue blogId blogTitle
same-blog 1 0.87782562 176 hunch net-2006-05-01-A conversation between Theo and Pat
Introduction: Pat (the practitioner) I need to do multiclass classification and I only have a decision tree. Theo (the thoeretician) Use an error correcting output code . Pat Oh, that’s cool. But the created binary problems seem unintuitive. I’m not sure the decision tree can solve them. Theo Oh? Is your problem a decision list? Pat No, I don’t think so. Theo Hmm. Are the classes well separated by axis aligned splits? Pat Err, maybe. I’m not sure. Theo Well, if they are, under the IID assumption I can tell you how many samples you need. Pat IID? The data is definitely not IID. Theo Oh dear. Pat Can we get back to the choice of ECOC? I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other. Theo Ok. What do you know about your problem? Pat Not much. My friend just gave me the dataset. Theo Then, no one can help you. Pat (What a fuzzy thinker. Theo keeps jumping t
2 0.49846864 35 hunch net-2005-03-04-The Big O and Constants in Learning
Introduction: The notation g(n) = O(f(n)) means that in the limit as n approaches infinity there exists a constant C such that the g(n) is less than Cf(n) . In learning theory, there are many statements about learning algorithms of the form “under assumptions x , y , and z , the classifier learned has an error rate of at most O(f(m)) “. There is one very good reason to use O(): it helps you understand the big picture and neglect the minor details which are not important in the big picture. However, there are some important reasons not to do this as well. Unspeedup In algorithm analysis, the use of O() for time complexity is pervasive and well-justified. Determining the exact value of C is inherently computer architecture dependent. (The “C” for x86 processors might differ from the “C” on PowerPC processors.) Since many learning theorists come from a CS theory background, the O() notation is applied to generalization error. The O() abstraction breaks here—you can not genera
3 0.47280401 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
4 0.45745373 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
Introduction: This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0 . An analagous result for reget is that for a probabilistic error correcting output code , multiclass regret is at most 4 * (binary regret) 0.5 . The use of “regret” is more desi
5 0.4413484 351 hunch net-2009-05-02-Wielding a New Abstraction
Introduction: This post is partly meant as an advertisement for the reductions tutorial Alina , Bianca , and I are planning to do at ICML . Please come, if you are interested. Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems. In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, yo
6 0.43383777 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary
7 0.43021646 14 hunch net-2005-02-07-The State of the Reduction
8 0.42731455 43 hunch net-2005-03-18-Binomial Weighting
9 0.4166452 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression
10 0.41640061 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
11 0.41382757 28 hunch net-2005-02-25-Problem: Online Learning
12 0.41190434 181 hunch net-2006-05-23-What is the best regret transform reduction from multiclass to binary?
13 0.40268308 487 hunch net-2013-07-24-ICML 2012 videos lost
14 0.40188646 289 hunch net-2008-02-17-The Meaning of Confidence
15 0.39762953 78 hunch net-2005-06-06-Exact Online Learning for Classification
16 0.39143372 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.
17 0.39045861 476 hunch net-2012-12-29-Simons Institute Big Data Program
18 0.3875699 162 hunch net-2006-03-09-Use of Notation
19 0.38231534 332 hunch net-2008-12-23-Use of Learning Theory
20 0.38187465 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time
topicId topicWeight
[(16, 0.403), (27, 0.144), (38, 0.063), (48, 0.017), (53, 0.027), (55, 0.044), (77, 0.017), (94, 0.039), (95, 0.061)]
simIndex simValue blogId blogTitle
1 0.94466805 197 hunch net-2006-07-17-A Winner
Introduction: Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. This bears a strong resemblance to a neural network. The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).
2 0.91070753 59 hunch net-2005-04-22-New Blog: [Lowerbounds,Upperbounds]
Introduction: Maverick Woo and the Aladdin group at CMU have started a CS theory-related blog here .
3 0.89446521 69 hunch net-2005-05-11-Visa Casualties
Introduction: For the Chicago 2005 machine learning summer school we are organizing, at least 5 international students can not come due to visa issues. There seem to be two aspects to visa issues: Inefficiency . The system rejected the student simply by being incapable of even starting to evaluate their visa in less than 1 month of time. Politics . Border controls became much tighter after the September 11 attack. Losing a big chunk of downtown of the largest city in a country will do that. What I (and the students) learned is that (1) is a much larger problem than (2). Only 1 prospective student seems to have achieved an explicit visa rejection. Fixing problem (1) should be a no-brainer, because the lag time almost surely indicates overload, and overload on border controls should worry even people concerned with (2). The obvious fixes to overload are “spend more money” and “make the system more efficient”. With respect to (2), (which is a more minor issue by the numbers) it i
same-blog 4 0.85724157 176 hunch net-2006-05-01-A conversation between Theo and Pat
Introduction: Pat (the practitioner) I need to do multiclass classification and I only have a decision tree. Theo (the thoeretician) Use an error correcting output code . Pat Oh, that’s cool. But the created binary problems seem unintuitive. I’m not sure the decision tree can solve them. Theo Oh? Is your problem a decision list? Pat No, I don’t think so. Theo Hmm. Are the classes well separated by axis aligned splits? Pat Err, maybe. I’m not sure. Theo Well, if they are, under the IID assumption I can tell you how many samples you need. Pat IID? The data is definitely not IID. Theo Oh dear. Pat Can we get back to the choice of ECOC? I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other. Theo Ok. What do you know about your problem? Pat Not much. My friend just gave me the dataset. Theo Then, no one can help you. Pat (What a fuzzy thinker. Theo keeps jumping t
5 0.71834689 414 hunch net-2010-10-17-Partha Niyogi has died
Introduction: from brain cancer. I asked Misha who worked with him to write about it. Partha Niyogi, Louis Block Professor in Computer Science and Statistics at the University of Chicago passed away on October 1, 2010, aged 43. I first met Partha Niyogi almost exactly ten years ago when I was a graduate student in math and he had just started as a faculty in Computer Science and Statistics at the University of Chicago. Strangely, we first talked at length due to a somewhat convoluted mathematical argument in a paper on pattern recognition. I asked him some questions about the paper, and, even though the topic was new to him, he had put serious thought into it and we started regular meetings. We made significant progress and developed a line of research stemming initially just from trying to understand that one paper and to simplify one derivation. I think this was typical of Partha, showing both his intellectual curiosity and his intuition for the serendipitous; having a sense and focus fo
6 0.64913392 447 hunch net-2011-10-10-ML Symposium and ICML details
7 0.57231534 19 hunch net-2005-02-14-Clever Methods of Overfitting
8 0.46584898 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
9 0.41387331 343 hunch net-2009-02-18-Decision by Vetocracy
10 0.404024 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning
11 0.40396503 371 hunch net-2009-09-21-Netflix finishes (and starts)
12 0.40277344 380 hunch net-2009-11-29-AI Safety
13 0.4002015 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
14 0.39416236 12 hunch net-2005-02-03-Learning Theory, by assumption
15 0.39188427 360 hunch net-2009-06-15-In Active Learning, the question changes
16 0.39067066 235 hunch net-2007-03-03-All Models of Learning have Flaws
17 0.3892979 36 hunch net-2005-03-05-Funding Research
18 0.38853869 194 hunch net-2006-07-11-New Models
19 0.38753694 259 hunch net-2007-08-19-Choice of Metrics
20 0.38712469 406 hunch net-2010-08-22-KDD 2010