hunch_net hunch_net-2005 hunch_net-2005-67 knowledge-graph by maker-knowledge-mining

67 hunch net-2005-05-06-Don’t mix the solution into the problem


meta infos for this blog

Source: html

Introduction: A common defect of many pieces of research is defining the problem in terms of the solution. Here are some examples in learning: “The learning problem is finding a good seperating hyperplane.” “The goal of learning is to minimize (y-p) 2 + C w 2 where y = the observation, p = the prediction and w = a parameter vector.” Defining the loss function to be the one that your algorithm optimizes rather than the one imposed by the world. The fundamental reason why this is a defect is that it creates artificial boundaries to problem solution. Artificial boundaries lead to the possibility of being blind-sided. For example, someone committing (1) or (2) above might find themselves themselves surprised to find a decision tree working well on a problem. Example (3) might result in someone else solving a learning problem better for real world purposes, even if it’s worse with respect to the algorithm optimization. This defect should be avoided so as to not artificially l


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A common defect of many pieces of research is defining the problem in terms of the solution. [sent-1, score-1.018]

2 Here are some examples in learning: “The learning problem is finding a good seperating hyperplane. [sent-2, score-0.248]

3 ” “The goal of learning is to minimize (y-p) 2 + C w 2 where y = the observation, p = the prediction and w = a parameter vector. [sent-3, score-0.224]

4 ” Defining the loss function to be the one that your algorithm optimizes rather than the one imposed by the world. [sent-4, score-0.413]

5 The fundamental reason why this is a defect is that it creates artificial boundaries to problem solution. [sent-5, score-1.311]

6 Artificial boundaries lead to the possibility of being blind-sided. [sent-6, score-0.438]

7 For example, someone committing (1) or (2) above might find themselves themselves surprised to find a decision tree working well on a problem. [sent-7, score-0.529]

8 Example (3) might result in someone else solving a learning problem better for real world purposes, even if it’s worse with respect to the algorithm optimization. [sent-8, score-0.711]

9 This defect should be avoided so as to not artificially limit your learning kungfu. [sent-9, score-0.934]

10 The way to avoid this defect while still limiting the scope of investigation to something you can manage is to be explicit. [sent-10, score-0.902]

11 Admit what the real world-imposed learning problem is. [sent-11, score-0.357]

12 For example “The problem is to find a classifier minimizing error rate”. [sent-12, score-0.73]

13 Be explicit about where the problem ends and the solution begins. [sent-13, score-0.354]

14 For example “We use a support vector machine with a l 2 loss to train a classifier. [sent-14, score-0.42]

15 We use the l 2 loss because it is an upper bound on the error rate which is computationally tractable to optimize. [sent-15, score-0.645]

16 For example “The error rate on our test set was 0. [sent-17, score-0.405]

17 ” It is important to note that this is not a critique about any particular method for solving learning problems, but rather about the process of thinking about learning problems. [sent-19, score-0.298]

18 Eliminating this thinking-bug will leave people more capable of appreciating and using different approaches to solve the real learning problem. [sent-20, score-0.457]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('defect', 0.563), ('boundaries', 0.281), ('artificial', 0.208), ('defining', 0.2), ('problem', 0.181), ('loss', 0.149), ('error', 0.145), ('rate', 0.141), ('find', 0.134), ('appreciating', 0.125), ('example', 0.119), ('admit', 0.116), ('artificially', 0.116), ('someone', 0.115), ('real', 0.109), ('avoided', 0.109), ('finish', 0.109), ('optimizes', 0.109), ('ends', 0.1), ('minimizing', 0.093), ('solving', 0.093), ('investigation', 0.091), ('purposes', 0.088), ('eliminating', 0.088), ('leave', 0.086), ('limiting', 0.084), ('train', 0.084), ('imposed', 0.084), ('scope', 0.082), ('manage', 0.082), ('possibility', 0.082), ('surprised', 0.082), ('parameter', 0.079), ('limit', 0.079), ('minimize', 0.078), ('creates', 0.078), ('upper', 0.076), ('lead', 0.075), ('pieces', 0.074), ('else', 0.074), ('explicit', 0.073), ('tractable', 0.072), ('worse', 0.072), ('rather', 0.071), ('capable', 0.07), ('vector', 0.068), ('learning', 0.067), ('tree', 0.064), ('computationally', 0.062), ('classifier', 0.058)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 67 hunch net-2005-05-06-Don’t mix the solution into the problem

Introduction: A common defect of many pieces of research is defining the problem in terms of the solution. Here are some examples in learning: “The learning problem is finding a good seperating hyperplane.” “The goal of learning is to minimize (y-p) 2 + C w 2 where y = the observation, p = the prediction and w = a parameter vector.” Defining the loss function to be the one that your algorithm optimizes rather than the one imposed by the world. The fundamental reason why this is a defect is that it creates artificial boundaries to problem solution. Artificial boundaries lead to the possibility of being blind-sided. For example, someone committing (1) or (2) above might find themselves themselves surprised to find a decision tree working well on a problem. Example (3) might result in someone else solving a learning problem better for real world purposes, even if it’s worse with respect to the algorithm optimization. This defect should be avoided so as to not artificially l

2 0.1691391 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

3 0.16787346 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

Introduction: This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0 . An analagous result for reget is that for a probabilistic error correcting output code , multiclass regret is at most 4 * (binary regret) 0.5 . The use of “regret” is more desi

4 0.13532998 347 hunch net-2009-03-26-Machine Learning is too easy

Introduction: One of the remarkable things about machine learning is how diverse it is. The viewpoints of Bayesian learning, reinforcement learning, graphical models, supervised learning, unsupervised learning, genetic programming, etc… share little enough overlap that many people can and do make their careers within one without touching, or even necessarily understanding the others. There are two fundamental reasons why this is possible. For many problems, many approaches work in the sense that they do something useful. This is true empirically, where for many problems we can observe that many different approaches yield better performance than any constant predictor. It’s also true in theory, where we know that for any set of predictors representable in a finite amount of RAM, minimizing training error over the set of predictors does something nontrivial when there are a sufficient number of examples. There is nothing like a unifying problem defining the field. In many other areas there

5 0.13075204 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

6 0.13023402 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

7 0.12731585 259 hunch net-2007-08-19-Choice of Metrics

8 0.12473522 274 hunch net-2007-11-28-Computational Consequences of Classification

9 0.12262534 245 hunch net-2007-05-12-Loss Function Semantics

10 0.11936478 78 hunch net-2005-06-06-Exact Online Learning for Classification

11 0.11818635 14 hunch net-2005-02-07-The State of the Reduction

12 0.11240716 370 hunch net-2009-09-18-Necessary and Sufficient Research

13 0.11034749 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

14 0.10850404 358 hunch net-2009-06-01-Multitask Poisoning

15 0.10787205 235 hunch net-2007-03-03-All Models of Learning have Flaws

16 0.10764377 44 hunch net-2005-03-21-Research Styles in Machine Learning

17 0.10541939 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

18 0.10515319 35 hunch net-2005-03-04-The Big O and Constants in Learning

19 0.09966062 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

20 0.097881727 19 hunch net-2005-02-14-Clever Methods of Overfitting


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.21), (1, 0.153), (2, 0.063), (3, -0.044), (4, -0.149), (5, -0.001), (6, 0.067), (7, 0.022), (8, -0.025), (9, 0.005), (10, -0.041), (11, -0.001), (12, 0.034), (13, 0.003), (14, 0.028), (15, 0.023), (16, 0.009), (17, -0.003), (18, -0.06), (19, 0.077), (20, 0.005), (21, 0.068), (22, -0.074), (23, 0.047), (24, 0.005), (25, 0.029), (26, 0.047), (27, 0.001), (28, -0.029), (29, -0.04), (30, -0.053), (31, 0.04), (32, 0.011), (33, 0.007), (34, 0.007), (35, 0.034), (36, 0.124), (37, -0.051), (38, -0.029), (39, 0.022), (40, 0.022), (41, -0.013), (42, 0.006), (43, -0.006), (44, 0.038), (45, 0.003), (46, -0.059), (47, -0.029), (48, 0.021), (49, -0.02)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.94705087 67 hunch net-2005-05-06-Don’t mix the solution into the problem

Introduction: A common defect of many pieces of research is defining the problem in terms of the solution. Here are some examples in learning: “The learning problem is finding a good seperating hyperplane.” “The goal of learning is to minimize (y-p) 2 + C w 2 where y = the observation, p = the prediction and w = a parameter vector.” Defining the loss function to be the one that your algorithm optimizes rather than the one imposed by the world. The fundamental reason why this is a defect is that it creates artificial boundaries to problem solution. Artificial boundaries lead to the possibility of being blind-sided. For example, someone committing (1) or (2) above might find themselves themselves surprised to find a decision tree working well on a problem. Example (3) might result in someone else solving a learning problem better for real world purposes, even if it’s worse with respect to the algorithm optimization. This defect should be avoided so as to not artificially l

2 0.72869021 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

Introduction: Suppose you are given a sequence of observations x 1 ,…,x T from some space and wish to predict a sequence of labels y 1 ,…,y T so as to minimize the Hamming loss: sum i=1 to T I(y i != c(x 1 ,…,x T ) i ) where c(x 1 ,…,x T ) i is the i th predicted component. For simplicity, suppose each label y i is in {0,1} . We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component y i independently since the loss is a linear combination of losses on each individual component i . From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te . This breakup into T different prediction problems is not the standard form of modularity in structured prediction. A more typical form of modularity is to predict y i given x i , y i-1 , y i+1 where the circularity (predicting given other

3 0.69353074 259 hunch net-2007-08-19-Choice of Metrics

Introduction: How do we judge success in Machine Learning? As Aaron notes , the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are: The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss. The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity. The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real wor

4 0.69223869 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

Introduction: This title is a lie, but it is a special lie which has a bit of truth. If n players each play each other, you have a tournament. How do you order the players from weakest to strongest? The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better). One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith , Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is

5 0.68500239 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

6 0.67605942 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

7 0.63960999 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

8 0.63560724 14 hunch net-2005-02-07-The State of the Reduction

9 0.63091743 35 hunch net-2005-03-04-The Big O and Constants in Learning

10 0.63009387 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics

11 0.62936825 33 hunch net-2005-02-28-Regularization

12 0.62302816 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC

13 0.61596113 218 hunch net-2006-11-20-Context and the calculation misperception

14 0.60903221 351 hunch net-2009-05-02-Wielding a New Abstraction

15 0.60006052 78 hunch net-2005-06-06-Exact Online Learning for Classification

16 0.59633452 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

17 0.58734119 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

18 0.57887203 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting

19 0.57869446 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

20 0.5762012 370 hunch net-2009-09-18-Necessary and Sufficient Research


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.082), (9, 0.03), (27, 0.307), (53, 0.061), (55, 0.051), (74, 0.258), (94, 0.055), (95, 0.047)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.91150296 217 hunch net-2006-11-06-Data Linkage Problems

Introduction: Data linkage is a problem which seems to come up in various applied machine learning problems. I have heard it mentioned in various data mining contexts, but it seems relatively less studied for systemic reasons. A very simple version of the data linkage problem is a cross hospital patient record merge. Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. Later, John Doe is admitted to a second hospital (Health General), treated, and released. Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year?” These examples could be fed into a learning algo

2 0.89941376 404 hunch net-2010-08-20-The Workshop on Cores, Clusters, and Clouds

Introduction: Alekh , John , Ofer , and I are organizing a workshop at NIPS this year on learning in parallel and distributed environments. The general interest level in parallel learning seems to be growing rapidly, so I expect quite a bit of attendance. Please join us if you are parallel-interested. And, if you are working in the area of parallel learning, please consider submitting an abstract due Oct. 17 for presentation at the workshop.

same-blog 3 0.88346791 67 hunch net-2005-05-06-Don’t mix the solution into the problem

Introduction: A common defect of many pieces of research is defining the problem in terms of the solution. Here are some examples in learning: “The learning problem is finding a good seperating hyperplane.” “The goal of learning is to minimize (y-p) 2 + C w 2 where y = the observation, p = the prediction and w = a parameter vector.” Defining the loss function to be the one that your algorithm optimizes rather than the one imposed by the world. The fundamental reason why this is a defect is that it creates artificial boundaries to problem solution. Artificial boundaries lead to the possibility of being blind-sided. For example, someone committing (1) or (2) above might find themselves themselves surprised to find a decision tree working well on a problem. Example (3) might result in someone else solving a learning problem better for real world purposes, even if it’s worse with respect to the algorithm optimization. This defect should be avoided so as to not artificially l

4 0.83929938 278 hunch net-2007-12-17-New Machine Learning mailing list

Introduction: IMLS (which is the nonprofit running ICML) has setup a new mailing list for Machine Learning News . The list address is ML-news@googlegroups.com, and signup requires a google account (which you can create). Only members can send messages.

5 0.83593875 370 hunch net-2009-09-18-Necessary and Sufficient Research

Introduction: Researchers are typically confronted with big problems that they have no idea how to solve. In trying to come up with a solution, a natural approach is to decompose the big problem into a set of subproblems whose solution yields a solution to the larger problem. This approach can go wrong in several ways. Decomposition failure . The solution to the decomposition does not in fact yield a solution to the overall problem. Artificial hardness . The subproblems created are sufficient if solved to solve the overall problem, but they are harder than necessary. As you can see, computational complexity forms a relatively new (in research-history) razor by which to judge an approach sufficient but not necessary. In my experience, the artificial hardness problem is very common. Many researchers abdicate the responsibility of choosing a problem to work on to other people. This process starts very naturally as a graduate student, when an incoming student might have relatively l

6 0.78453439 183 hunch net-2006-06-14-Explorations of Exploration

7 0.77200919 289 hunch net-2008-02-17-The Meaning of Confidence

8 0.76629287 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

9 0.76430207 9 hunch net-2005-02-01-Watchword: Loss

10 0.76012766 41 hunch net-2005-03-15-The State of Tight Bounds

11 0.76005989 220 hunch net-2006-11-27-Continuizing Solutions

12 0.75804222 258 hunch net-2007-08-12-Exponentiated Gradient

13 0.75763875 304 hunch net-2008-06-27-Reviewing Horror Stories

14 0.7570796 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

15 0.75524056 352 hunch net-2009-05-06-Machine Learning to AI

16 0.75498891 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

17 0.75446534 360 hunch net-2009-06-15-In Active Learning, the question changes

18 0.75439447 45 hunch net-2005-03-22-Active learning

19 0.75408912 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

20 0.75348592 347 hunch net-2009-03-26-Machine Learning is too easy