hunch_net hunch_net-2010 hunch_net-2010-391 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be The Problem The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions. In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.
sentIndex sentText sentNum sentScore
1 In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. [sent-5, score-0.803]
2 A learning reduction consists of two algorithms R and R -1 which transform examples from the original input problem into examples for the oracle and then transform the oracle’s predictions into a prediction for the original problem. [sent-8, score-1.377]
3 R then specifies a training example (x’,y’) for the oracle B . [sent-10, score-0.646]
4 A basic observation is that for any oracle algorithm, a distribution D(x,y) over multiclass examples and a reduction R induces a distribution over a sequence (x’,y’) * of oracle examples. [sent-13, score-1.664]
5 We collapse this into a distribution D’(x’,y’) over oracle examples by drawing uniformly from the sequence. [sent-14, score-0.834]
6 We measure the power of an oracle and a reduction according to squared-loss regret. [sent-16, score-0.719]
7 Alternatively, this open problem is satisfied by proving there exists no deterministic algorithms R,R -1 satisfying the above theorem statement. [sent-19, score-0.452]
8 Typically conditional probability estimation is done in situations where the conditional probability of only one bit is required, however there are a growing number of applications where a well-estimated conditional probability over a more complex object is required. [sent-22, score-0.796]
9 The motivation for using the learning reduction framework to specify this problem is a combination of generality and the empirical effectiveness in application of learning reductions. [sent-27, score-0.494]
10 Any solution to this will be general because any oracle B can be plugged in, even ones which use many strange kinds of prior information, features, and active multitask hierachical (insert your favorite adjective here) structure. [sent-28, score-0.626]
11 For multiclass classification in a partial label setting, no learning reduction can provide a constant regret guarantee . [sent-34, score-0.464]
12 On the other hand, because R calls the oracle at least once, there is a defined induced distribution D’ . [sent-39, score-0.64]
13 Since the theorem must hold for all D and B , it must hold for a D your specified learning algorithm fails on and for a B for which reg (D’,B)=0 implying the theorem is not satisfied. [sent-40, score-0.879]
14 Feed random examples into B and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant. [sent-41, score-0.619]
15 In particular, if the oracle is given examples of the form (x’,y’) where y’ is uniformly at random either 0 or 1 , any oracle specifying B(x’)=0. [sent-43, score-1.44]
16 Feed pseudorandom examples into B and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant. [sent-45, score-0.663]
17 This doesn’t work, because the quantification is “for all binary oracles B ”, and there exists one which, knowing the pseudorandom seed, can achieve zero loss (and hence zero regret). [sent-46, score-0.72]
18 The oracle here is not limited in this fashion since it could completely err for a small fraction of invocations. [sent-49, score-0.561]
19 Employing this approach is not straightforward, because the average in D’ is over an increased number of oracle examples. [sent-51, score-0.658]
20 Hence, at a fixed expected (over oracle examples) regret, the number of examples allowed to have a large regret is increased. [sent-52, score-0.821]
wordName wordTfidf (topN-words)
[('oracle', 0.561), ('reg', 0.234), ('theorem', 0.18), ('reduction', 0.158), ('regret', 0.138), ('conditional', 0.132), ('examples', 0.122), ('zero', 0.121), ('hence', 0.113), ('pseudorandom', 0.106), ('vacuously', 0.106), ('multiclass', 0.104), ('probability', 0.101), ('specified', 0.1), ('satisfying', 0.1), ('increased', 0.097), ('estimation', 0.097), ('binary', 0.095), ('problem', 0.093), ('motivation', 0.091), ('feed', 0.087), ('offering', 0.087), ('log', 0.086), ('training', 0.085), ('specify', 0.083), ('quantification', 0.082), ('oracles', 0.082), ('open', 0.079), ('distribution', 0.079), ('satisfy', 0.075), ('hand', 0.074), ('input', 0.073), ('uniformly', 0.072), ('tricks', 0.072), ('using', 0.069), ('transform', 0.068), ('correcting', 0.066), ('structure', 0.065), ('favorite', 0.065), ('robustness', 0.065), ('provide', 0.064), ('algorithm', 0.063), ('random', 0.062), ('specifying', 0.062), ('hold', 0.061), ('reward', 0.058), ('testing', 0.058), ('original', 0.056), ('doesn', 0.056), ('boosting', 0.055)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999958 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
Introduction: I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be The Problem The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions. In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.
2 0.26508594 388 hunch net-2010-01-24-Specializations of the Master Problem
Introduction: One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as: The world announces an observation x . The agent makes a choice a . The world announces a reward r . The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of
3 0.25128528 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary
Introduction: This problem has been cracked (but not quite completely solved) by Alina , Pradeep , and I . The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here: For the single elimination tournament, we can prove that: For all multiclass problems D , for all learned binary classifiers c , the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log 2 k . Restated: reg multiclass (D,Filter_tree_test(c)) <= reg binary (Filter_tree_train(D),c) Here: Filter_tree_train(D) is the induced binary classification problem Filter_tree_test(c) is the induced multiclass classifier. reg multiclass is the multiclass regret (= difference between error rate and minim
4 0.24037376 181 hunch net-2006-05-23-What is the best regret transform reduction from multiclass to binary?
Introduction: This post is about an open problem in learning reductions. Background A reduction might transform a a multiclass prediction problem where there are k possible labels into a binary learning problem where there are only 2 possible labels. On this induced binary problem we might learn a binary classifier with some error rate e . After subtracting the minimum possible (Bayes) error rate b , we get a regret r = e – b . The PECOC (Probabilistic Error Correcting Output Code) reduction has the property that binary regret r implies multiclass regret at most 4r 0.5 . The problem This is not the “rightest” answer. Consider the k=2 case, where we reduce binary to binary. There exists a reduction (the identity) with the property that regret r implies regret r . This is substantially superior to the transform given by the PECOC reduction, which suggests that a better reduction may exist for general k . For example, we can not rule out the possibility that a reduction
5 0.20180236 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions
Introduction: A type of prediction problem is specified by the type of samples produced by a data source (Example: X x {0,1} , X x [0,1] , X x {1,2,3,4,5} , etc…) and a loss function (0/1 loss, squared error loss, cost sensitive losses, etc…). For simplicity, we’ll assume that all losses have a minimum of zero. For this post, we can think of a learning reduction as A mapping R from samples of one type T (like multiclass classification) to another type T’ (like binary classification). A mapping Q from predictors for type T’ to predictors for type T . The simplest sort of learning reduction is a “loss reduction”. The idea in a loss reduction is to prove a statement of the form: Theorem For all base predictors b , for all distributions D over examples of type T : E (x,y) ~ D L T (y,Q(b,x)) <= f(E (x’,y’)~R(D) L T’ (y’,b(x’))) Here L T is the loss for the type T problem and L T’ is the loss for the type T’ problem. Also, R(D) is the distribution ov
6 0.17966212 14 hunch net-2005-02-07-The State of the Reduction
7 0.17659192 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
8 0.17237541 133 hunch net-2005-11-28-A question of quantification
9 0.14277193 351 hunch net-2009-05-02-Wielding a New Abstraction
10 0.13955849 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
11 0.13653855 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
12 0.13466936 235 hunch net-2007-03-03-All Models of Learning have Flaws
13 0.12904753 83 hunch net-2005-06-18-Lower Bounds for Learning Reductions
14 0.12585507 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
15 0.12445451 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
16 0.12198175 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?
17 0.12189317 245 hunch net-2007-05-12-Loss Function Semantics
18 0.11733039 183 hunch net-2006-06-14-Explorations of Exploration
19 0.1164578 164 hunch net-2006-03-17-Multitask learning is Black-Boxable
20 0.11614706 343 hunch net-2009-02-18-Decision by Vetocracy
topicId topicWeight
[(0, 0.265), (1, 0.217), (2, 0.096), (3, -0.102), (4, -0.028), (5, -0.053), (6, 0.175), (7, -0.043), (8, -0.101), (9, 0.026), (10, -0.035), (11, -0.053), (12, -0.002), (13, 0.021), (14, -0.051), (15, 0.045), (16, -0.006), (17, -0.034), (18, 0.066), (19, 0.025), (20, -0.106), (21, -0.026), (22, 0.148), (23, -0.029), (24, -0.014), (25, 0.011), (26, -0.048), (27, -0.017), (28, 0.029), (29, -0.031), (30, -0.013), (31, -0.013), (32, -0.003), (33, 0.028), (34, 0.037), (35, -0.071), (36, -0.07), (37, 0.057), (38, -0.009), (39, 0.026), (40, 0.013), (41, 0.015), (42, 0.007), (43, 0.012), (44, -0.032), (45, -0.074), (46, 0.005), (47, 0.048), (48, -0.009), (49, -0.017)]
simIndex simValue blogId blogTitle
same-blog 1 0.95849723 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
Introduction: I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be The Problem The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions. In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.
2 0.77325988 181 hunch net-2006-05-23-What is the best regret transform reduction from multiclass to binary?
Introduction: This post is about an open problem in learning reductions. Background A reduction might transform a a multiclass prediction problem where there are k possible labels into a binary learning problem where there are only 2 possible labels. On this induced binary problem we might learn a binary classifier with some error rate e . After subtracting the minimum possible (Bayes) error rate b , we get a regret r = e – b . The PECOC (Probabilistic Error Correcting Output Code) reduction has the property that binary regret r implies multiclass regret at most 4r 0.5 . The problem This is not the “rightest” answer. Consider the k=2 case, where we reduce binary to binary. There exists a reduction (the identity) with the property that regret r implies regret r . This is substantially superior to the transform given by the PECOC reduction, which suggests that a better reduction may exist for general k . For example, we can not rule out the possibility that a reduction
3 0.76189107 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time
Introduction: This title is a lie, but it is a special lie which has a bit of truth. If n players each play each other, you have a tournament. How do you order the players from weakest to strongest? The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better). One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith , Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is
4 0.76080525 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary
Introduction: This problem has been cracked (but not quite completely solved) by Alina , Pradeep , and I . The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here: For the single elimination tournament, we can prove that: For all multiclass problems D , for all learned binary classifiers c , the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log 2 k . Restated: reg multiclass (D,Filter_tree_test(c)) <= reg binary (Filter_tree_train(D),c) Here: Filter_tree_train(D) is the induced binary classification problem Filter_tree_test(c) is the induced multiclass classifier. reg multiclass is the multiclass regret (= difference between error rate and minim
5 0.73938894 14 hunch net-2005-02-07-The State of the Reduction
Introduction: What? Reductions are machines which turn solvers for one problem into solvers for another problem. Why? Reductions are useful for several reasons. Laziness . Reducing a problem to classification make at least 10 learning algorithms available to solve a problem. Inventing 10 learning algorithms is quite a bit of work. Similarly, programming a reduction is often trivial, while programming a learning algorithm is a great deal of work. Crystallization . The problems we often want to solve in learning are worst-case-impossible, but average case feasible. By reducing all problems onto one or a few primitives, we can fine tune these primitives to perform well on real-world problems with greater precision due to the greater number of problems to validate on. Theoretical Organization . By studying what reductions are easy vs. hard vs. impossible, we can learn which problems are roughly equivalent in difficulty and which are much harder. What we know now . Typesafe r
6 0.68872803 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions
7 0.6591984 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
8 0.63971061 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression
9 0.63135165 388 hunch net-2010-01-24-Specializations of the Master Problem
10 0.62738854 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
11 0.60401434 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
12 0.58691144 351 hunch net-2009-05-02-Wielding a New Abstraction
13 0.58510762 133 hunch net-2005-11-28-A question of quantification
14 0.58506137 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
15 0.56112379 269 hunch net-2007-10-24-Contextual Bandits
16 0.56004602 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?
17 0.5512616 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
18 0.54016912 83 hunch net-2005-06-18-Lower Bounds for Learning Reductions
19 0.53440398 78 hunch net-2005-06-06-Exact Online Learning for Classification
20 0.52999794 43 hunch net-2005-03-18-Binomial Weighting
topicId topicWeight
[(0, 0.016), (3, 0.282), (10, 0.018), (27, 0.209), (38, 0.12), (48, 0.019), (53, 0.046), (55, 0.045), (64, 0.023), (77, 0.037), (94, 0.066), (95, 0.02)]
simIndex simValue blogId blogTitle
1 0.96273232 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary
Introduction: This problem has been cracked (but not quite completely solved) by Alina , Pradeep , and I . The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here: For the single elimination tournament, we can prove that: For all multiclass problems D , for all learned binary classifiers c , the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log 2 k . Restated: reg multiclass (D,Filter_tree_test(c)) <= reg binary (Filter_tree_train(D),c) Here: Filter_tree_train(D) is the induced binary classification problem Filter_tree_test(c) is the induced multiclass classifier. reg multiclass is the multiclass regret (= difference between error rate and minim
2 0.95188648 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features
Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr
same-blog 3 0.93729359 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
Introduction: I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be The Problem The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions. In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.
4 0.91401762 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.
Introduction: Classical confidence intervals satisfy a theorem of the form: For some data sources D , Pr S ~ D (f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S . The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D “. There are even some confidence intervals which do not require IID samples. Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter. There are a number of results in theoretical machine learn
5 0.87783766 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics
Introduction: This, again, is something of a research direction rather than a single problem. There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC , “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class ( google ‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these. Problem What does the ability to classify well imply about performance under these metrics? Past Work Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC. Sample complexity bounds for AROC (and here ). A paper on “ Learning to Order Things “. Difficulty Several of these may be easy. Some of them may be h
6 0.85330433 401 hunch net-2010-06-20-2010 ICML discussion site
7 0.85301685 289 hunch net-2008-02-17-The Meaning of Confidence
8 0.79340386 183 hunch net-2006-06-14-Explorations of Exploration
9 0.73700315 484 hunch net-2013-06-16-Representative Reviewing
10 0.73528314 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions
11 0.73320335 34 hunch net-2005-03-02-Prior, “Prior” and Bias
12 0.73023987 259 hunch net-2007-08-19-Choice of Metrics
13 0.72457409 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
14 0.72329783 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
15 0.71627593 78 hunch net-2005-06-06-Exact Online Learning for Classification
16 0.69428247 129 hunch net-2005-11-07-Prediction Competitions
17 0.69372159 41 hunch net-2005-03-15-The State of Tight Bounds
18 0.69275367 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”
19 0.69214427 19 hunch net-2005-02-14-Clever Methods of Overfitting
20 0.68790948 461 hunch net-2012-04-09-ICML author feedback is open