hunch_net hunch_net-2005 hunch_net-2005-18 knowledge-graph by maker-knowledge-mining

18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC


meta infos for this blog

Source: html

Introduction: Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion. The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order. Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance. AROC is the area under the ROC curve. It is a single number summary of performance. The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.” The ROC vs. accuracy discussion is o


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Foster Provost and I discussed the merits of ROC curves vs. [sent-1, score-0.204]

2 The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. [sent-4, score-0.764]

3 The ROC curve is a curve and not a single number statistic. [sent-5, score-0.581]

4 In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order. [sent-6, score-0.21]

5 The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. [sent-11, score-0.232]

6 accuracy discussion is often conflated with “is the goal classification or ranking? [sent-14, score-0.572]

7 ” because ROC curve construction requires a ranking be produced. [sent-15, score-0.361]

8 (There are several natural problems where ranking of instances is much preferred to classification. [sent-17, score-0.2]

9 In addition, there are several natural problems where classification is the goal. [sent-18, score-0.194]

10 ) Arguments for ROC Explanation Ill-specification The costs of choices are not well specified. [sent-19, score-0.169]

11 ROC curves allow for an effective comparison over a range of different choice costs and marginal distributions. [sent-21, score-0.591]

12 Ill-dominance Standard classification algorithms do not have a dominance structure as the costs vary. [sent-22, score-0.333]

13 We should not say “algorithm A is better than algorithm B” when you don’t know the choice costs well enough to be sure. [sent-23, score-0.228]

14 Just-in-Time use Any system with a good ROC curve can easily be designed with a ‘knob’ that controls the rate of false positives vs. [sent-24, score-0.549]

15 Arguments for AROC Explanation Summarization Humans don’t have the time to understand the complexities of a conditional comparison, so having a single number instead of a curve is valuable. [sent-27, score-0.403]

16 Intuitiveness People understand immediately what accuracy means. [sent-31, score-0.411]

17 Statistical Stability The basic test set bound shows that accuracy is stable subject to only the IID assumption. [sent-33, score-0.495]

18 ROC curves become problematic when there are just 3 classes. [sent-39, score-0.203]

19 One way to rephrase this argument is “Lack of knowledge of relative costs means that classifiers should be rankers so false positive to false negative ratios can be easily altered. [sent-41, score-0.671]

20 ” In other words, this is an argument for “ranking instead of classification” rather than “(A)ROC instead of Accuracy”. [sent-42, score-0.197]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('roc', 0.626), ('accuracy', 0.411), ('aroc', 0.244), ('curve', 0.235), ('costs', 0.169), ('curves', 0.165), ('arguments', 0.132), ('false', 0.132), ('explanation', 0.128), ('ranking', 0.126), ('classification', 0.12), ('comparison', 0.12), ('summary', 0.099), ('summarization', 0.098), ('argument', 0.083), ('marginal', 0.078), ('natural', 0.074), ('single', 0.064), ('choice', 0.059), ('instead', 0.057), ('easily', 0.057), ('classified', 0.049), ('generalizes', 0.049), ('positives', 0.049), ('provost', 0.049), ('ratios', 0.049), ('receiver', 0.049), ('rephrase', 0.049), ('number', 0.047), ('obvious', 0.046), ('strongest', 0.045), ('algorithms', 0.044), ('affair', 0.043), ('characteristic', 0.043), ('stable', 0.043), ('test', 0.041), ('alternate', 0.041), ('foster', 0.041), ('measures', 0.041), ('variation', 0.041), ('goal', 0.041), ('controls', 0.039), ('merits', 0.039), ('performance', 0.039), ('standard', 0.038), ('evaluate', 0.038), ('metrics', 0.038), ('problematic', 0.038), ('stability', 0.038), ('rate', 0.037)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999982 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC

Introduction: Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion. The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order. Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance. AROC is the area under the ROC curve. It is a single number summary of performance. The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.” The ROC vs. accuracy discussion is o

2 0.16255432 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics

Introduction: This, again, is something of a research direction rather than a single problem. There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC , “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class ( google ‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these. Problem What does the ability to classify well imply about performance under these metrics? Past Work Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC. Sample complexity bounds for AROC (and here ). A paper on “ Learning to Order Things “. Difficulty Several of these may be easy. Some of them may be h

3 0.14067999 131 hunch net-2005-11-16-The Everything Ensemble Edge

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v

4 0.13030234 19 hunch net-2005-02-14-Clever Methods of Overfitting

Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe

5 0.12157189 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

6 0.10710339 14 hunch net-2005-02-07-The State of the Reduction

7 0.089814357 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

8 0.088752963 259 hunch net-2007-08-19-Choice of Metrics

9 0.082532011 177 hunch net-2006-05-05-An ICML reject

10 0.071928166 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

11 0.070794024 274 hunch net-2007-11-28-Computational Consequences of Classification

12 0.068753548 454 hunch net-2012-01-30-ICML Posters and Scope

13 0.065986745 138 hunch net-2005-12-09-Some NIPS papers

14 0.063369066 304 hunch net-2008-06-27-Reviewing Horror Stories

15 0.060943514 430 hunch net-2011-04-11-The Heritage Health Prize

16 0.060162116 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

17 0.059827343 406 hunch net-2010-08-22-KDD 2010

18 0.05938058 178 hunch net-2006-05-08-Big machine learning

19 0.056220438 235 hunch net-2007-03-03-All Models of Learning have Flaws

20 0.054334432 247 hunch net-2007-06-14-Interesting Papers at COLT 2007


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.132), (1, 0.072), (2, 0.043), (3, -0.025), (4, -0.024), (5, 0.006), (6, 0.028), (7, 0.032), (8, -0.03), (9, -0.024), (10, -0.047), (11, 0.065), (12, 0.005), (13, 0.002), (14, -0.005), (15, 0.009), (16, 0.018), (17, -0.001), (18, -0.059), (19, 0.011), (20, 0.037), (21, -0.018), (22, -0.022), (23, -0.006), (24, -0.031), (25, -0.007), (26, 0.013), (27, -0.059), (28, -0.024), (29, -0.101), (30, -0.028), (31, -0.015), (32, -0.013), (33, -0.022), (34, 0.093), (35, 0.061), (36, 0.033), (37, 0.022), (38, 0.088), (39, 0.032), (40, 0.095), (41, -0.069), (42, -0.007), (43, 0.006), (44, 0.074), (45, 0.009), (46, 0.016), (47, 0.017), (48, -0.008), (49, -0.052)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95623583 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC

Introduction: Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion. The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order. Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance. AROC is the area under the ROC curve. It is a single number summary of performance. The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.” The ROC vs. accuracy discussion is o

2 0.71842498 19 hunch net-2005-02-14-Clever Methods of Overfitting

Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe

3 0.69591177 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics

Introduction: This, again, is something of a research direction rather than a single problem. There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC , “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class ( google ‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these. Problem What does the ability to classify well imply about performance under these metrics? Past Work Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC. Sample complexity bounds for AROC (and here ). A paper on “ Learning to Order Things “. Difficulty Several of these may be easy. Some of them may be h

4 0.67888242 131 hunch net-2005-11-16-The Everything Ensemble Edge

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v

5 0.6685155 177 hunch net-2006-05-05-An ICML reject

Introduction: Hal , Daniel , and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm. Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers. Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know. Speed. Searn has no problem handling much larger datasets than other algorithms we tested against. Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to

6 0.63279229 26 hunch net-2005-02-21-Problem: Cross Validation

7 0.58148003 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

8 0.55586731 43 hunch net-2005-03-18-Binomial Weighting

9 0.54568756 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

10 0.54269397 87 hunch net-2005-06-29-Not EM for clustering at COLT

11 0.53372717 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

12 0.53332675 67 hunch net-2005-05-06-Don’t mix the solution into the problem

13 0.50581539 14 hunch net-2005-02-07-The State of the Reduction

14 0.50080603 393 hunch net-2010-04-14-MLcomp: a website for objectively comparing ML algorithms

15 0.50045991 138 hunch net-2005-12-09-Some NIPS papers

16 0.4821296 148 hunch net-2006-01-13-Benchmarks for RL

17 0.47274965 104 hunch net-2005-08-22-Do you believe in induction?

18 0.47265106 259 hunch net-2007-08-19-Choice of Metrics

19 0.47249177 32 hunch net-2005-02-27-Antilearning: When proximity goes bad

20 0.46717167 120 hunch net-2005-10-10-Predictive Search is Coming


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.023), (16, 0.023), (27, 0.224), (38, 0.057), (53, 0.042), (55, 0.052), (64, 0.343), (94, 0.061), (95, 0.057)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.96302801 155 hunch net-2006-02-07-Pittsburgh Mind Reading Competition

Introduction: Francisco Pereira points out a fun Prediction Competition . Francisco says: DARPA is sponsoring a competition to analyze data from an unusual functional Magnetic Resonance Imaging experiment. Subjects watch videos inside the scanner while fMRI data are acquired. Unbeknownst to these subjects, the videos have been seen by a panel of other subjects that labeled each instant with labels in categories such as representation (are there tools, body parts, motion, sound), location, presence of actors, emotional content, etc. The challenge is to predict all of these different labels on an instant-by-instant basis from the fMRI data. A few reasons why this is particularly interesting: This is beyond the current state of the art, but not inconceivably hard. This is a new type of experiment design current analysis methods cannot deal with. This is an opportunity to work with a heavily examined and preprocessed neuroimaging dataset. DARPA is offering prizes!

2 0.91761869 442 hunch net-2011-08-20-The Large Scale Learning Survey Tutorial

Introduction: Ron Bekkerman initiated an effort to create an edited book on parallel machine learning that Misha and I have been helping with. The breadth of efforts to parallelize machine learning surprised me: I was only aware of a small fraction initially. This put us in a unique position, with knowledge of a wide array of different efforts, so it is natural to put together a survey tutorial on the subject of parallel learning for KDD , tomorrow. This tutorial is not limited to the book itself however, as several interesting new algorithms have come out since we started inviting chapters. This tutorial should interest anyone trying to use machine learning on significant quantities of data, anyone interested in developing algorithms for such, and of course who has bragging rights to the fastest learning algorithm on planet earth (Also note the Modeling with Hadoop tutorial just before ours which deals with one way of trying to speed up learning algorithms. We have almost no

3 0.86653149 420 hunch net-2010-12-26-NIPS 2010

Introduction: I enjoyed attending NIPS this year, with several things interesting me. For the conference itself: Peter Welinder , Steve Branson , Serge Belongie , and Pietro Perona , The Multidimensional Wisdom of Crowds . This paper is about using mechanical turk to get label information, with results superior to a majority vote approach. David McAllester , Tamir Hazan , and Joseph Keshet Direct Loss Minimization for Structured Prediction . This is about another technique for directly optimizing the loss in structured prediction, with an application to speech recognition. Mohammad Saberian and Nuno Vasconcelos Boosting Classifier Cascades . This is about an algorithm for simultaneously optimizing loss and computation in a classifier cascade construction. There were several other papers on cascades which are worth looking at if interested. Alan Fern and Prasad Tadepalli , A Computational Decision Theory for Interactive Assistants . This paper carves out some

same-blog 4 0.86589342 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC

Introduction: Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion. The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order. Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance. AROC is the area under the ROC curve. It is a single number summary of performance. The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.” The ROC vs. accuracy discussion is o

5 0.86520988 210 hunch net-2006-09-28-Programming Languages for Machine Learning Implementations

Introduction: Machine learning algorithms have a much better chance of being widely adopted if they are implemented in some easy-to-use code. There are several important concerns associated with machine learning which stress programming languages on the ease-of-use vs. speed frontier. Speed The rate at which data sources are growing seems to be outstripping the rate at which computational power is growing, so it is important that we be able to eak out every bit of computational power. Garbage collected languages ( java , ocaml , perl and python ) often have several issues here. Garbage collection often implies that floating point numbers are “boxed”: every float is represented by a pointer to a float. Boxing can cause an order of magnitude slowdown because an extra nonlocalized memory reference is made, and accesses to main memory can are many CPU cycles long. Garbage collection often implies that considerably more memory is used than is necessary. This has a variable effect. I

6 0.862324 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design

7 0.80108964 291 hunch net-2008-03-07-Spock Challenge Winners

8 0.65244162 343 hunch net-2009-02-18-Decision by Vetocracy

9 0.63098294 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

10 0.61983299 351 hunch net-2009-05-02-Wielding a New Abstraction

11 0.6173169 360 hunch net-2009-06-15-In Active Learning, the question changes

12 0.60986841 49 hunch net-2005-03-30-What can Type Theory teach us about Machine Learning?

13 0.60916233 131 hunch net-2005-11-16-The Everything Ensemble Edge

14 0.60470617 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

15 0.6046775 194 hunch net-2006-07-11-New Models

16 0.60358173 432 hunch net-2011-04-20-The End of the Beginning of Active Learning

17 0.60291976 26 hunch net-2005-02-21-Problem: Cross Validation

18 0.60091782 424 hunch net-2011-02-17-What does Watson mean?

19 0.59906954 378 hunch net-2009-11-15-The Other Online Learning

20 0.59609669 371 hunch net-2009-09-21-Netflix finishes (and starts)