hunch_net hunch_net-2005 hunch_net-2005-31 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: This, again, is something of a research direction rather than a single problem. There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC , “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class ( google ‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these. Problem What does the ability to classify well imply about performance under these metrics? Past Work Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC. Sample complexity bounds for AROC (and here ). A paper on “ Learning to Order Things “. Difficulty Several of these may be easy. Some of them may be h
sentIndex sentText sentNum sentScore
1 This, again, is something of a research direction rather than a single problem. [sent-1, score-0.162]
2 There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. [sent-2, score-1.446]
3 Examples include AROC , “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class ( google ‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. [sent-3, score-2.933]
4 Problem What does the ability to classify well imply about performance under these metrics? [sent-5, score-0.46]
5 Past Work Probabilistic classification under squared error can be solved with a classifier. [sent-6, score-0.305]
6 A counterexample shows this does not imply a good AROC. [sent-7, score-0.425]
7 Sample complexity bounds for AROC (and here ). [sent-8, score-0.136]
8 Impact Positive or negative results will broaden our understanding of the relationship between different learning goals. [sent-12, score-0.221]
9 It might also yield new algorithms (via the reduction) for solving these problems. [sent-13, score-0.15]
wordName wordTfidf (topN-words)
[('ranked', 0.369), ('aroc', 0.328), ('proportion', 0.303), ('metrics', 0.253), ('top', 0.238), ('class', 0.218), ('ranking', 0.212), ('care', 0.191), ('imply', 0.175), ('counterexample', 0.164), ('classify', 0.152), ('depend', 0.152), ('examples', 0.142), ('relationship', 0.127), ('distance', 0.113), ('element', 0.111), ('predicted', 0.111), ('squared', 0.097), ('positive', 0.094), ('negative', 0.094), ('google', 0.093), ('relative', 0.091), ('direction', 0.09), ('yield', 0.089), ('past', 0.088), ('probabilistic', 0.086), ('upon', 0.086), ('shows', 0.086), ('correct', 0.085), ('sort', 0.081), ('impact', 0.08), ('include', 0.079), ('reduction', 0.079), ('solved', 0.078), ('difficulty', 0.075), ('sample', 0.075), ('bounds', 0.073), ('single', 0.072), ('ability', 0.068), ('classification', 0.067), ('reasons', 0.065), ('may', 0.065), ('performance', 0.065), ('order', 0.065), ('error', 0.063), ('complexity', 0.063), ('several', 0.063), ('example', 0.062), ('via', 0.062), ('solving', 0.061)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics
Introduction: This, again, is something of a research direction rather than a single problem. There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC , “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class ( google ‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these. Problem What does the ability to classify well imply about performance under these metrics? Past Work Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC. Sample complexity bounds for AROC (and here ). A paper on “ Learning to Order Things “. Difficulty Several of these may be easy. Some of them may be h
2 0.16255432 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC
Introduction: Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion. The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order. Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance. AROC is the area under the ROC curve. It is a single number summary of performance. The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.” The ROC vs. accuracy discussion is o
3 0.12451397 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class
Introduction: Yann LeCun and I are coteaching a class on Large Scale Machine Learning starting late January at NYU . This class will cover many tricks to get machine learning working well on datasets with many features, examples, and classes, along with several elements of deep learning and support systems enabling the previous. This is not a beginning class—you really need to have taken a basic machine learning class previously to follow along. Students will be able to run and experiment with large scale learning algorithms since Yahoo! has donated servers which are being configured into a small scale Hadoop cluster. We are planning to cover the frontier of research in scalable learning algorithms, so good class projects could easily lead to papers. For me, this is a chance to teach on many topics of past research. In general, it seems like researchers should engage in at least occasional teaching of research, both as a proof of teachability and to see their own research through th
4 0.12255333 304 hunch net-2008-06-27-Reviewing Horror Stories
Introduction: Essentially everyone who writes research papers suffers rejections. They always sting immediately, but upon further reflection many of these rejections come to seem reasonable. Maybe the equations had too many typos or maybe the topic just isn’t as important as was originally thought. A few rejections do not come to seem acceptable, and these form the basis of reviewing horror stories, a great material for conversations. I’ve decided to share three of mine, now all safely a bit distant in the past. Prediction Theory for Classification Tutorial . This is a tutorial about tight sample complexity bounds for classification that I submitted to JMLR . The first decision I heard was a reject which appeared quite unjust to me—for example one of the reviewers appeared to claim that all the content was in standard statistics books. Upon further inquiry, several citations were given, none of which actually covered the content. Later, I was shocked to hear the paper was accepted. App
5 0.11176896 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
Introduction: For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered. Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Pr x,y~D (c(x) y) . The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, E x,y~D (s(x)-y) 2 . It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions. These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x) , then threshold at 0.5 . For this simple reduction a squared error regret of r
6 0.1089903 14 hunch net-2005-02-07-The State of the Reduction
7 0.10565121 6 hunch net-2005-01-27-Learning Complete Problems
8 0.094201066 41 hunch net-2005-03-15-The State of Tight Bounds
9 0.092467137 351 hunch net-2009-05-02-Wielding a New Abstraction
10 0.088345282 98 hunch net-2005-07-27-Not goal metrics
11 0.087954856 12 hunch net-2005-02-03-Learning Theory, by assumption
12 0.087548159 347 hunch net-2009-03-26-Machine Learning is too easy
13 0.081015088 82 hunch net-2005-06-17-Reopening RL->Classification
14 0.079973683 156 hunch net-2006-02-11-Yahoo’s Learning Problems.
15 0.076796979 310 hunch net-2008-07-15-Interesting papers at COLT (and a bit of UAI & workshops)
16 0.076111965 133 hunch net-2005-11-28-A question of quantification
17 0.073395833 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time
18 0.072929792 22 hunch net-2005-02-18-What it means to do research.
19 0.07277064 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
20 0.071144894 274 hunch net-2007-11-28-Computational Consequences of Classification
topicId topicWeight
[(0, 0.175), (1, 0.078), (2, 0.037), (3, -0.0), (4, -0.027), (5, -0.027), (6, 0.074), (7, 0.01), (8, -0.026), (9, -0.025), (10, -0.014), (11, 0.035), (12, 0.002), (13, -0.002), (14, 0.011), (15, 0.037), (16, 0.021), (17, -0.02), (18, -0.008), (19, -0.008), (20, 0.073), (21, 0.013), (22, -0.013), (23, -0.054), (24, 0.016), (25, -0.014), (26, 0.009), (27, 0.008), (28, -0.04), (29, -0.02), (30, 0.008), (31, -0.041), (32, -0.065), (33, -0.067), (34, 0.033), (35, 0.027), (36, 0.06), (37, 0.064), (38, 0.071), (39, 0.02), (40, 0.049), (41, -0.098), (42, -0.004), (43, 0.024), (44, 0.086), (45, 0.041), (46, -0.033), (47, -0.019), (48, 0.043), (49, -0.041)]
simIndex simValue blogId blogTitle
same-blog 1 0.95046794 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics
Introduction: This, again, is something of a research direction rather than a single problem. There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC , “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class ( google ‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these. Problem What does the ability to classify well imply about performance under these metrics? Past Work Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC. Sample complexity bounds for AROC (and here ). A paper on “ Learning to Order Things “. Difficulty Several of these may be easy. Some of them may be h
2 0.79700631 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC
Introduction: Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion. The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order. Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance. AROC is the area under the ROC curve. It is a single number summary of performance. The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.” The ROC vs. accuracy discussion is o
3 0.68543732 14 hunch net-2005-02-07-The State of the Reduction
Introduction: What? Reductions are machines which turn solvers for one problem into solvers for another problem. Why? Reductions are useful for several reasons. Laziness . Reducing a problem to classification make at least 10 learning algorithms available to solve a problem. Inventing 10 learning algorithms is quite a bit of work. Similarly, programming a reduction is often trivial, while programming a learning algorithm is a great deal of work. Crystallization . The problems we often want to solve in learning are worst-case-impossible, but average case feasible. By reducing all problems onto one or a few primitives, we can fine tune these primitives to perform well on real-world problems with greater precision due to the greater number of problems to validate on. Theoretical Organization . By studying what reductions are easy vs. hard vs. impossible, we can learn which problems are roughly equivalent in difficulty and which are much harder. What we know now . Typesafe r
4 0.65170044 27 hunch net-2005-02-23-Problem: Reinforcement Learning with Classification
Introduction: At an intuitive level, the question here is “Can reinforcement learning be solved with classification?” Problem Construct a reinforcement learning algorithm with near-optimal expected sum of rewards in the direct experience model given access to a classifier learning algorithm which has a small error rate or regret on all posed classification problems. The definition of “posed” here is slightly murky. I consider a problem “posed” if there is an algorithm for constructing labeled classification examples. Past Work There exists a reduction of reinforcement learning to classification given a generative model. A generative model is an inherently stronger assumption than the direct experience model. Other work on learning reductions may be important. Several algorithms for solving reinforcement learning in the direct experience model exist. Most, such as E 3 , Factored-E 3 , and metric-E 3 and Rmax require that the observation be the state. Recent work
5 0.62696296 347 hunch net-2009-03-26-Machine Learning is too easy
Introduction: One of the remarkable things about machine learning is how diverse it is. The viewpoints of Bayesian learning, reinforcement learning, graphical models, supervised learning, unsupervised learning, genetic programming, etc… share little enough overlap that many people can and do make their careers within one without touching, or even necessarily understanding the others. There are two fundamental reasons why this is possible. For many problems, many approaches work in the sense that they do something useful. This is true empirically, where for many problems we can observe that many different approaches yield better performance than any constant predictor. It’s also true in theory, where we know that for any set of predictors representable in a finite amount of RAM, minimizing training error over the set of predictors does something nontrivial when there are a sufficient number of examples. There is nothing like a unifying problem defining the field. In many other areas there
6 0.62319297 247 hunch net-2007-06-14-Interesting Papers at COLT 2007
7 0.614838 67 hunch net-2005-05-06-Don’t mix the solution into the problem
8 0.6103099 138 hunch net-2005-12-09-Some NIPS papers
9 0.60972553 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time
10 0.6096276 177 hunch net-2006-05-05-An ICML reject
11 0.6083504 101 hunch net-2005-08-08-Apprenticeship Reinforcement Learning for Control
12 0.60216868 148 hunch net-2006-01-13-Benchmarks for RL
13 0.59634364 133 hunch net-2005-11-28-A question of quantification
14 0.58049905 351 hunch net-2009-05-02-Wielding a New Abstraction
15 0.57379544 104 hunch net-2005-08-22-Do you believe in induction?
16 0.57121336 2 hunch net-2005-01-24-Holy grails of machine learning?
17 0.56032914 131 hunch net-2005-11-16-The Everything Ensemble Edge
18 0.55964589 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
19 0.55894488 42 hunch net-2005-03-17-Going all the Way, Sometimes
20 0.55785179 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”
topicId topicWeight
[(3, 0.435), (27, 0.28), (53, 0.101), (55, 0.061)]
simIndex simValue blogId blogTitle
1 0.96955544 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary
Introduction: This problem has been cracked (but not quite completely solved) by Alina , Pradeep , and I . The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here: For the single elimination tournament, we can prove that: For all multiclass problems D , for all learned binary classifiers c , the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log 2 k . Restated: reg multiclass (D,Filter_tree_test(c)) <= reg binary (Filter_tree_train(D),c) Here: Filter_tree_train(D) is the induced binary classification problem Filter_tree_test(c) is the induced multiclass classifier. reg multiclass is the multiclass regret (= difference between error rate and minim
2 0.93632704 401 hunch net-2010-06-20-2010 ICML discussion site
Introduction: A substantial difficulty with the 2009 and 2008 ICML discussion system was a communication vacuum, where authors were not informed of comments, and commenters were not informed of responses to their comments without explicit monitoring. Mark Reid has setup a new discussion system for 2010 with the goal of addressing this. Mark didn’t want to make it to intrusive, so you must opt-in. As an author, find your paper and “Subscribe by email” to the comments. As a commenter, you have the option of providing an email for follow-up notification.
3 0.92825127 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features
Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr
same-blog 4 0.90731889 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics
Introduction: This, again, is something of a research direction rather than a single problem. There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC , “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class ( google ‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these. Problem What does the ability to classify well imply about performance under these metrics? Past Work Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC. Sample complexity bounds for AROC (and here ). A paper on “ Learning to Order Things “. Difficulty Several of these may be easy. Some of them may be h
5 0.87851042 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.
Introduction: Classical confidence intervals satisfy a theorem of the form: For some data sources D , Pr S ~ D (f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S . The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D “. There are even some confidence intervals which do not require IID samples. Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter. There are a number of results in theoretical machine learn
6 0.83829004 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
7 0.79551071 289 hunch net-2008-02-17-The Meaning of Confidence
8 0.7424711 183 hunch net-2006-06-14-Explorations of Exploration
9 0.70083576 484 hunch net-2013-06-16-Representative Reviewing
10 0.66589165 34 hunch net-2005-03-02-Prior, “Prior” and Bias
11 0.64034474 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
12 0.63552415 67 hunch net-2005-05-06-Don’t mix the solution into the problem
13 0.62923682 129 hunch net-2005-11-07-Prediction Competitions
14 0.62522566 461 hunch net-2012-04-09-ICML author feedback is open
15 0.61854887 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification
16 0.61818266 78 hunch net-2005-06-06-Exact Online Learning for Classification
17 0.60973966 41 hunch net-2005-03-15-The State of Tight Bounds
18 0.60722798 9 hunch net-2005-02-01-Watchword: Loss
19 0.60143882 259 hunch net-2007-08-19-Choice of Metrics
20 0.5989024 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?