hunch_net hunch_net-2006 hunch_net-2006-177 knowledge-graph by maker-knowledge-mining

177 hunch net-2006-05-05-An ICML reject


meta infos for this blog

Source: html

Introduction: Hal , Daniel , and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm. Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers. Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know. Speed. Searn has no problem handling much larger datasets than other algorithms we tested against. Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. [sent-4, score-0.21]

2 Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. [sent-7, score-0.584]

3 It can use (and cope with) arbitrary loss functions over the data. [sent-15, score-0.191]

4 A very typical (although often unstated) tradeoff is expending extra computation to gain better predictive performance in practice. [sent-18, score-0.294]

5 One endorsement of this comes from a reviewer who said In the end, I do think there is something there, but I think its introduction should have been like that for CRF. [sent-26, score-0.175]

6 The SPC stated: The results, though, which essentially show that good local classifiers imply good global performance, are not that significant, and hold for other approaches that use local classifiers as building blocks. [sent-30, score-1.119]

7 After all, perfect local classifiers provide perfect local accuracy, and therefore provide perfect global accuracy, and again provide perfect global loss of any kind. [sent-31, score-2.511]

8 Both sentences are simply false in the setting we consider. [sent-32, score-0.199]

9 In particular, no other algorithms appear to have a good local performance to global performance guarantee for general global loss functions. [sent-33, score-1.537]

10 Furthermore, it is not the case that perfect local performance implies perfect global performance except (perhaps) in a noise free world. [sent-34, score-1.566]

11 Most of us believe that the problems we address typically contain fundamental ambiguities and noise (that was certainly our mathematical model). [sent-35, score-0.196]

12 It’s easy to setup a (noisy) distribution over inputs+loss such that best-possible-up-to-the-noise-limit local predictors are globally suboptimal. [sent-36, score-0.25]

13 The SPC wanted us to contrast with Michael Collins, Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, EMNLP02 . [sent-37, score-0.196]

14 I believe any reasonable reading of these and the Searn paper will find the ACL04 paper superceeds the EMNLP02 paper in relevance. [sent-39, score-0.195]

15 IBT is a modification of the earlier algorithm which integrates global information (in the form of problem specific constraints ) into the local training process yielding performance gains. [sent-50, score-1.01]

16 Searn is made to cope with global loss functions rather than global constraints. [sent-52, score-0.871]

17 For whatever reason, it is psychologically difficult to build on rejected work. [sent-57, score-0.192]

18 If you think something is true but aren’t sure, it is appropriate to say “I think …” rather than simply asserting it as a fact. [sent-65, score-0.387]

19 If there are no comments or simply comments about Searn, that’s fine. [sent-68, score-0.203]

20 This post violates a standard: avoiding talking about specific papers the poster has been working on. [sent-73, score-0.177]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('searn', 0.386), ('global', 0.311), ('local', 0.25), ('perfect', 0.226), ('performance', 0.215), ('spc', 0.202), ('classifiers', 0.154), ('ibt', 0.123), ('punyakanok', 0.123), ('false', 0.11), ('collins', 0.109), ('asserting', 0.109), ('roth', 0.109), ('loss', 0.109), ('rejected', 0.104), ('modification', 0.101), ('structured', 0.092), ('simply', 0.089), ('build', 0.088), ('cope', 0.082), ('tradeoff', 0.079), ('perceptron', 0.079), ('tested', 0.077), ('aren', 0.075), ('provide', 0.074), ('algorithm', 0.073), ('us', 0.071), ('accuracy', 0.07), ('hal', 0.07), ('true', 0.069), ('noise', 0.065), ('paper', 0.065), ('stated', 0.064), ('guarantee', 0.064), ('wanted', 0.063), ('michael', 0.062), ('algorithms', 0.062), ('contrast', 0.062), ('post', 0.061), ('specific', 0.06), ('think', 0.06), ('problems', 0.06), ('experiments', 0.059), ('weak', 0.059), ('case', 0.058), ('made', 0.058), ('comments', 0.057), ('papers', 0.056), ('looking', 0.056), ('reviewer', 0.055)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 177 hunch net-2006-05-05-An ICML reject

Introduction: Hal , Daniel , and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm. Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers. Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know. Speed. Searn has no problem handling much larger datasets than other algorithms we tested against. Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to

2 0.15670927 351 hunch net-2009-05-02-Wielding a New Abstraction

Introduction: This post is partly meant as an advertisement for the reductions tutorial Alina , Bianca , and I are planning to do at ICML . Please come, if you are interested. Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems. In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, yo

3 0.15580584 19 hunch net-2005-02-14-Clever Methods of Overfitting

Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe

4 0.15242817 343 hunch net-2009-02-18-Decision by Vetocracy

Introduction: Few would mistake the process of academic paper review for a fair process, but sometimes the unfairness seems particularly striking. This is most easily seen by comparison: Paper Banditron Offset Tree Notes Problem Scope Multiclass problems where only the loss of one choice can be probed. Strictly greater: Cost sensitive multiclass problems where only the loss of one choice can be probed. Often generalizations don’t matter. That’s not the case here, since every plausible application I’ve thought of involves loss functions substantially different from 0/1. What’s new Analysis and Experiments Algorithm, Analysis, and Experiments As far as I know, the essence of the more general problem was first stated and analyzed with the EXP4 algorithm (page 16) (1998). It’s also the time horizon 1 simplification of the Reinforcement Learning setting for the random trajectory method (page 15) (2002). The Banditron algorithm itself is functionally identi

5 0.14564919 238 hunch net-2007-04-13-What to do with an unreasonable conditional accept

Introduction: Last year about this time, we received a conditional accept for the searn paper , which asked us to reference a paper that was not reasonable to cite because there was strictly more relevant work by the same authors that we already cited. We wrote a response explaining this, and didn’t cite it in the final draft, giving the SPC an excuse to reject the paper , leading to unhappiness for all. Later, Sanjoy Dasgupta suggested that an alternative was to talk to the PC chair instead, as soon as you see that a conditional accept is unreasonable. William Cohen and I spoke about this by email, the relevant bit of which is: If an SPC asks for a revision that is inappropriate, the correct action is to contact the chairs as soon as the decision is made, clearly explaining what the problem is, so we can decide whether or not to over-rule the SPC. As you say, this is extra work for us chairs, but that’s part of the job, and we’re willing to do that sort of work to improve the ov

6 0.13605313 87 hunch net-2005-06-29-Not EM for clustering at COLT

7 0.13427861 43 hunch net-2005-03-18-Binomial Weighting

8 0.13168116 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

9 0.12655589 454 hunch net-2012-01-30-ICML Posters and Scope

10 0.12288994 420 hunch net-2010-12-26-NIPS 2010

11 0.11857257 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

12 0.11856646 14 hunch net-2005-02-07-The State of the Reduction

13 0.11798044 22 hunch net-2005-02-18-What it means to do research.

14 0.11758158 235 hunch net-2007-03-03-All Models of Learning have Flaws

15 0.11547669 9 hunch net-2005-02-01-Watchword: Loss

16 0.11519089 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

17 0.11250316 394 hunch net-2010-04-24-COLT Treasurer is now Phil Long

18 0.11172078 304 hunch net-2008-06-27-Reviewing Horror Stories

19 0.11096048 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

20 0.11038096 437 hunch net-2011-07-10-ICML 2011 and the future


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.301), (1, 0.053), (2, 0.098), (3, -0.012), (4, -0.047), (5, 0.028), (6, -0.008), (7, -0.032), (8, -0.005), (9, 0.048), (10, -0.004), (11, 0.003), (12, -0.015), (13, -0.03), (14, 0.046), (15, 0.013), (16, 0.045), (17, 0.014), (18, -0.015), (19, -0.037), (20, 0.028), (21, -0.068), (22, -0.045), (23, -0.115), (24, -0.069), (25, -0.007), (26, -0.028), (27, -0.039), (28, 0.043), (29, -0.121), (30, -0.076), (31, 0.015), (32, -0.015), (33, -0.032), (34, 0.067), (35, 0.102), (36, -0.024), (37, 0.026), (38, 0.036), (39, 0.049), (40, 0.051), (41, -0.005), (42, -0.034), (43, -0.066), (44, 0.154), (45, -0.026), (46, -0.034), (47, 0.056), (48, -0.022), (49, 0.065)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97610837 177 hunch net-2006-05-05-An ICML reject

Introduction: Hal , Daniel , and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm. Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers. Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know. Speed. Searn has no problem handling much larger datasets than other algorithms we tested against. Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to

2 0.70126903 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC

Introduction: Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion. The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order. Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance. AROC is the area under the ROC curve. It is a single number summary of performance. The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.” The ROC vs. accuracy discussion is o

3 0.65224534 19 hunch net-2005-02-14-Clever Methods of Overfitting

Introduction: “Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets. We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid. Name Method Explanation Remedy Traditional overfitting Train a complex predictor on too-few examples. Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this. Parameter twe

4 0.63035321 52 hunch net-2005-04-04-Grounds for Rejection

Introduction: It’s reviewing season right now, so I thought I would list (at a high level) the sorts of problems which I see in papers. Hopefully, this will help us all write better papers. The following flaws are fatal to any paper: Incorrect theorem or lemma statements A typo might be “ok”, if it can be understood. Any theorem or lemma which indicates an incorrect understanding of reality must be rejected. Not doing so would severely harm the integrity of the conference. A paper rejected for this reason must be fixed. Lack of Understanding If a paper is understood by none of the (typically 3) reviewers then it must be rejected for the same reason. This is more controversial than it sounds because there are some people who maximize paper complexity in the hope of impressing the reviewer. The tactic sometimes succeeds with some reviewers (but not with me). As a reviewer, I sometimes get lost for stupid reasons. This is why an anonymized communication channel with the author can

5 0.62841636 343 hunch net-2009-02-18-Decision by Vetocracy

Introduction: Few would mistake the process of academic paper review for a fair process, but sometimes the unfairness seems particularly striking. This is most easily seen by comparison: Paper Banditron Offset Tree Notes Problem Scope Multiclass problems where only the loss of one choice can be probed. Strictly greater: Cost sensitive multiclass problems where only the loss of one choice can be probed. Often generalizations don’t matter. That’s not the case here, since every plausible application I’ve thought of involves loss functions substantially different from 0/1. What’s new Analysis and Experiments Algorithm, Analysis, and Experiments As far as I know, the essence of the more general problem was first stated and analyzed with the EXP4 algorithm (page 16) (1998). It’s also the time horizon 1 simplification of the Reinforcement Learning setting for the random trajectory method (page 15) (2002). The Banditron algorithm itself is functionally identi

6 0.62140781 351 hunch net-2009-05-02-Wielding a New Abstraction

7 0.61567134 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics

8 0.61275876 87 hunch net-2005-06-29-Not EM for clustering at COLT

9 0.6105516 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

10 0.60737795 454 hunch net-2012-01-30-ICML Posters and Scope

11 0.60604155 42 hunch net-2005-03-17-Going all the Way, Sometimes

12 0.58552951 131 hunch net-2005-11-16-The Everything Ensemble Edge

13 0.58397746 98 hunch net-2005-07-27-Not goal metrics

14 0.57958972 43 hunch net-2005-03-18-Binomial Weighting

15 0.57868207 420 hunch net-2010-12-26-NIPS 2010

16 0.56721264 104 hunch net-2005-08-22-Do you believe in induction?

17 0.56691808 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms

18 0.56007892 103 hunch net-2005-08-18-SVM Adaptability

19 0.55805969 393 hunch net-2010-04-14-MLcomp: a website for objectively comparing ML algorithms

20 0.55254692 333 hunch net-2008-12-27-Adversarial Academia


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.013), (3, 0.019), (9, 0.019), (10, 0.041), (27, 0.25), (38, 0.063), (46, 0.223), (53, 0.082), (55, 0.056), (92, 0.011), (94, 0.094), (95, 0.038)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.95492315 397 hunch net-2010-05-02-What’s the difference between gambling and rewarding good prediction?

Introduction: After a major financial crisis , there is much discussion about how finance has become a casino gambling with other’s money, keeping the winnings, and walking away when the money is lost. When thinking about financial reform, all the many losers in the above scenario are apt to take the view that this activity should be completely, or nearly completely curtailed. But, a more thoughtful view is that sometimes there is a real sense in which there are right and wrong decisions, and we as a society would really prefer that the people most likely to make right decisions are making them. A crucial question then is: “What is the difference between gambling and rewarding good prediction?” We discussed this before the financial crisis . The cheat-sheet sketch is that the online learning against an adversary problem, algorithm, and theorems, provide a good mathematical model for thinking about this question. What I would like to do here is map this onto various types of financial t

2 0.91551429 26 hunch net-2005-02-21-Problem: Cross Validation

Introduction: The essential problem here is the large gap between experimental observation and theoretical understanding. Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K . For each fold, a classifier is trained on the other folds and then test on the fold. Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate. Past Work (I’ll add more as I remember/learn.) Devroye , Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper . Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K . Avrim Blum, Adam Kalai , and myself analyzed cross validation and found tha

same-blog 3 0.91280627 177 hunch net-2006-05-05-An ICML reject

Introduction: Hal , Daniel , and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm. Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers. Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know. Speed. Searn has no problem handling much larger datasets than other algorithms we tested against. Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to

4 0.90213162 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

Introduction: Accountability is a social problem. When someone screws up, do you fire them? Or do you accept the error and let them continue? This is a very difficult problem and we all know of stories where the wrong decision was made. Online learning (as meant here), is a subfield of learning theory which analyzes the online learning model. In the online learning model, there are a set of hypotheses or “experts”. On any instantance x , each expert makes a prediction y . A master algorithm A uses these predictions to form it’s own prediction y A and then learns the correct prediction y * . This process repeats. The goal of online learning is to find a master algorithm A which uses the advice of the experts to make good predictions. In particular, we typically want to guarantee that the master algorithm performs almost as well as the best expert. If L(e) is the loss of expert e and L(A) is the loss of the master algorithm, it is often possible to prove: L(A) les

5 0.79245877 131 hunch net-2005-11-16-The Everything Ensemble Edge

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v

6 0.78366011 235 hunch net-2007-03-03-All Models of Learning have Flaws

7 0.78296816 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

8 0.78232658 95 hunch net-2005-07-14-What Learning Theory might do

9 0.78156781 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

10 0.78117847 347 hunch net-2009-03-26-Machine Learning is too easy

11 0.77971631 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

12 0.77924812 332 hunch net-2008-12-23-Use of Learning Theory

13 0.77903759 360 hunch net-2009-06-15-In Active Learning, the question changes

14 0.77881944 258 hunch net-2007-08-12-Exponentiated Gradient

15 0.77792025 351 hunch net-2009-05-02-Wielding a New Abstraction

16 0.77692795 78 hunch net-2005-06-06-Exact Online Learning for Classification

17 0.77639467 41 hunch net-2005-03-15-The State of Tight Bounds

18 0.77502644 14 hunch net-2005-02-07-The State of the Reduction

19 0.77498525 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

20 0.77487785 12 hunch net-2005-02-03-Learning Theory, by assumption