hunch_net hunch_net-2005 hunch_net-2005-43 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x , make a prediction y’ (possibly based on the classifiers c ), and then see the correct label y . When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log 2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated. Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual clas
sentIndex sentText sentNum sentScore
1 Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. [sent-1, score-0.567]
2 In particular, we repeatedly see an unlabeled example x , make a prediction y’ (possibly based on the classifiers c ), and then see the correct label y . [sent-2, score-0.657]
3 When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. [sent-3, score-1.164]
4 It makes at most log 2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated. [sent-5, score-0.681]
5 Obviously, we can’t generally hope that the there exists a classifier which never errs. [sent-6, score-0.165]
6 The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. [sent-7, score-2.023]
7 By introducing a “prior” over the number of mistakes, it can be made parameter free. [sent-9, score-0.652]
8 Similarly, introducing a “prior” over the set of classifiers is easy and makes the algorithm sufficiently flexible for common use. [sent-10, score-1.196]
9 The minimal value of f() is 2 times the number of errors of any classifier, regardless of the number of classifiers. [sent-12, score-0.803]
10 This is frustrating because a parameter-free learning algorithm taking an arbitrary “prior” and achieving good performance on an arbitrary (not even IID) set of examples is compelling for implementation and use, if we had a good technique for removing the factor of 2 . [sent-13, score-0.927]
11 See the weighted majority algorithm for an example of a similar algorithm which can remove a factor of 2 using randomization and at the expense of introducing a parameter. [sent-15, score-1.081]
12 There are known techniques for eliminating this parameter, but they appear not as tight (and therefore practically useful) as introducing a “prior” over the number of errors. [sent-16, score-0.666]
wordName wordTfidf (topN-words)
[('classifiers', 0.354), ('introducing', 0.332), ('errors', 0.309), ('halving', 0.249), ('binomial', 0.235), ('weighting', 0.201), ('mistakes', 0.201), ('algorithm', 0.201), ('parameter', 0.171), ('classifier', 0.165), ('prior', 0.162), ('virtual', 0.149), ('number', 0.149), ('minimal', 0.127), ('set', 0.118), ('majority', 0.114), ('arbitrary', 0.103), ('factor', 0.098), ('every', 0.098), ('technique', 0.096), ('see', 0.095), ('maximal', 0.078), ('frustrating', 0.075), ('mistake', 0.075), ('occasionally', 0.075), ('removing', 0.075), ('vote', 0.075), ('randomization', 0.072), ('regardless', 0.069), ('satisfies', 0.069), ('cope', 0.067), ('flexible', 0.067), ('disagree', 0.065), ('elegant', 0.065), ('min', 0.065), ('makes', 0.064), ('eliminating', 0.063), ('remove', 0.063), ('therefore', 0.062), ('half', 0.062), ('perfect', 0.062), ('variant', 0.06), ('possibly', 0.06), ('tight', 0.06), ('sufficiently', 0.06), ('consistent', 0.059), ('implementation', 0.058), ('repeatedly', 0.058), ('called', 0.055), ('unlabeled', 0.055)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000002 43 hunch net-2005-03-18-Binomial Weighting
Introduction: Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x , make a prediction y’ (possibly based on the classifiers c ), and then see the correct label y . When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log 2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated. Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual clas
2 0.21092767 28 hunch net-2005-02-25-Problem: Online Learning
Introduction: Despite my best intentions, this is not a fully specified problem, but rather a research direction. Competitive online learning is one of the more compelling pieces of learning theory because typical statements of the form “this algorithm will perform almost as well as a large set of other algorithms” rely only on fully-observable quantities, and are therefore applicable in many situations. Examples include Winnow , Weighted Majority , and Binomial Weighting . Algorithms with this property haven’t taken over the world yet. Here might be some reasons: Lack of caring . Many people working on learning theory don’t care about particular applications much. This means constants in the algorithm are not optimized, usable code is often not produced, and empirical studies aren’t done. Inefficiency . Viewed from the perspective of other learning algorithms, online learning is terribly inefficient. It requires that every hypothesis (called an expert in the online learning set
3 0.14081267 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
Introduction: This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0 . An analagous result for reget is that for a probabilistic error correcting output code , multiclass regret is at most 4 * (binary regret) 0.5 . The use of “regret” is more desi
4 0.13427861 177 hunch net-2006-05-05-An ICML reject
Introduction: Hal , Daniel , and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm. Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers. Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know. Speed. Searn has no problem handling much larger datasets than other algorithms we tested against. Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to
5 0.1298968 274 hunch net-2007-11-28-Computational Consequences of Classification
Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints
6 0.12621137 160 hunch net-2006-03-02-Why do people count for learning?
7 0.12437008 14 hunch net-2005-02-07-The State of the Reduction
8 0.12414001 127 hunch net-2005-11-02-Progress in Active Learning
9 0.12374199 34 hunch net-2005-03-02-Prior, “Prior” and Bias
10 0.11741754 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
11 0.1157499 237 hunch net-2007-04-02-Contextual Scaling
12 0.11442 247 hunch net-2007-06-14-Interesting Papers at COLT 2007
13 0.11396057 26 hunch net-2005-02-21-Problem: Cross Validation
14 0.097817086 267 hunch net-2007-10-17-Online as the new adjective
15 0.096487686 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability
16 0.095629387 19 hunch net-2005-02-14-Clever Methods of Overfitting
17 0.095500603 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting
18 0.093889445 12 hunch net-2005-02-03-Learning Theory, by assumption
19 0.090774223 163 hunch net-2006-03-12-Online learning or online preservation of learning?
20 0.089848101 104 hunch net-2005-08-22-Do you believe in induction?
topicId topicWeight
[(0, 0.177), (1, 0.152), (2, 0.053), (3, -0.049), (4, 0.032), (5, -0.037), (6, 0.05), (7, 0.008), (8, -0.005), (9, 0.015), (10, -0.014), (11, 0.042), (12, 0.117), (13, -0.103), (14, 0.052), (15, -0.042), (16, -0.047), (17, -0.001), (18, -0.018), (19, 0.027), (20, -0.035), (21, -0.07), (22, 0.121), (23, -0.017), (24, 0.047), (25, -0.037), (26, 0.083), (27, 0.068), (28, 0.048), (29, -0.094), (30, -0.016), (31, -0.007), (32, 0.03), (33, 0.029), (34, 0.096), (35, 0.063), (36, -0.013), (37, 0.029), (38, 0.067), (39, 0.027), (40, 0.007), (41, -0.065), (42, -0.05), (43, -0.104), (44, 0.05), (45, -0.01), (46, -0.032), (47, 0.026), (48, 0.03), (49, 0.033)]
simIndex simValue blogId blogTitle
same-blog 1 0.97795475 43 hunch net-2005-03-18-Binomial Weighting
Introduction: Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x , make a prediction y’ (possibly based on the classifiers c ), and then see the correct label y . When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log 2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated. Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual clas
2 0.71536958 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
3 0.6994648 160 hunch net-2006-03-02-Why do people count for learning?
Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat
4 0.60584092 133 hunch net-2005-11-28-A question of quantification
Introduction: This is about methods for phrasing and think about the scope of some theorems in learning theory. The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same. For all sequences of examples . This is the standard quantification in online learning analysis. Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert.” For all training sets . This is the standard quantification for boosting analysis such as adaboost or multiclass boosting . Standard theorems have the form “for all training sets the error rate inequalities … hold”. For all distributions over examples . This is the one that we have been using for reductions analysis. Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”. It is not quite true that each of these is equivalent. F
5 0.58808619 34 hunch net-2005-03-02-Prior, “Prior” and Bias
Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt
6 0.58430594 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time
7 0.5829125 26 hunch net-2005-02-21-Problem: Cross Validation
8 0.57170475 32 hunch net-2005-02-27-Antilearning: When proximity goes bad
9 0.56985414 28 hunch net-2005-02-25-Problem: Online Learning
10 0.56678808 19 hunch net-2005-02-14-Clever Methods of Overfitting
11 0.55444694 18 hunch net-2005-02-12-ROC vs. Accuracy vs. AROC
12 0.55158085 78 hunch net-2005-06-06-Exact Online Learning for Classification
13 0.55149019 163 hunch net-2006-03-12-Online learning or online preservation of learning?
14 0.54950392 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
15 0.54925138 131 hunch net-2005-11-16-The Everything Ensemble Edge
16 0.54185754 7 hunch net-2005-01-31-Watchword: Assumption
17 0.54128957 104 hunch net-2005-08-22-Do you believe in induction?
18 0.54027951 218 hunch net-2006-11-20-Context and the calculation misperception
19 0.53271621 12 hunch net-2005-02-03-Learning Theory, by assumption
20 0.53247172 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
topicId topicWeight
[(0, 0.025), (26, 0.245), (27, 0.299), (53, 0.055), (55, 0.062), (77, 0.044), (94, 0.12), (95, 0.041)]
simIndex simValue blogId blogTitle
1 0.93467653 97 hunch net-2005-07-23-Interesting papers at ACL
Introduction: A recent discussion indicated that one goal of this blog might be to allow people to post comments about recent papers that they liked. I think this could potentially be very useful, especially for those with diverse interests but only finite time to read through conference proceedings. ACL 2005 recently completed, and here are four papers from that conference that I thought were either good or perhaps of interest to a machine learning audience. David Chiang, A Hierarchical Phrase-Based Model for Statistical Machine Translation . (Best paper award.) This paper takes the standard phrase-based MT model that is popular in our field (basically, translate a sentence by individually translating phrases and reordering them according to a complicated statistical model) and extends it to take into account hierarchy in phrases, so that you can learn things like “X ‘s Y” -> “Y de X” in chinese, where X and Y are arbitrary phrases. This takes a step toward linguistic syntax for MT, whic
same-blog 2 0.92533797 43 hunch net-2005-03-18-Binomial Weighting
Introduction: Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x , make a prediction y’ (possibly based on the classifiers c ), and then see the correct label y . When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log 2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated. Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual clas
3 0.9117468 25 hunch net-2005-02-20-At One Month
Introduction: This is near the one month point, so it seems appropriate to consider meta-issues for the moment. The number of posts is a bit over 20. The number of people speaking up in discussions is about 10. The number of people viewing the site is somewhat more than 100. I am (naturally) dissatisfied with many things. Many of the potential uses haven’t been realized. This is partly a matter of opportunity (no conferences in the last month), partly a matter of will (no open problems because it’s hard to give them up), and partly a matter of tradition. In academia, there is a strong tradition of trying to get everything perfectly right before presentation. This is somewhat contradictory to the nature of making many posts, and it’s definitely contradictory to the idea of doing “public research”. If that sort of idea is to pay off, it must be significantly more succesful than previous methods. In an effort to continue experimenting, I’m going to use the next week as “open problems we
4 0.87463355 413 hunch net-2010-10-08-An easy proof of the Chernoff-Hoeffding bound
Introduction: Textbooks invariably seem to carry the proof that uses Markov’s inequality, moment-generating functions, and Taylor approximations. Here’s an easier way. For , let be the KL divergence between a coin of bias and one of bias : Theorem: Suppose you do independent tosses of a coin of bias . The probability of seeing heads or more, for , is at most . So is the probability of seeing heads or less, for . Remark: By Pinsker’s inequality, . Proof Let’s do the case; the other is identical. Let be the distribution over induced by a coin of bias , and likewise for a coin of bias . Let be the set of all sequences of tosses which contain heads or more. We’d like to show that is unlikely under . Pick any , with say heads. Then: Since for every , we have and we’re done.
5 0.87289971 171 hunch net-2006-04-09-Progress in Machine Translation
Introduction: I just visited ISI where Daniel Marcu and others are working on machine translation. Apparently, machine translation is rapidly improving. A particularly dramatic year was 2002->2003 when systems switched from word-based translation to phrase-based translation. From a (now famous) slide by Charles Wayne at DARPA (which funds much of the work on machine translation) here is some anecdotal evidence: 2002 2003 insistent Wednesday may recurred her trips to Libya tomorrow for flying. Cairo 6-4 ( AFP ) – An official announced today in the Egyptian lines company for flying Tuesday is a company “insistent for flying” may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment. And said the official “the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air, a situation her recieving replying are so a trip will pull to Libya a morning Wednesday.” E
6 0.79389799 258 hunch net-2007-08-12-Exponentiated Gradient
7 0.79076469 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
8 0.78620368 351 hunch net-2009-05-02-Wielding a New Abstraction
9 0.78446263 360 hunch net-2009-06-15-In Active Learning, the question changes
10 0.78393483 252 hunch net-2007-07-01-Watchword: Online Learning
11 0.78317463 235 hunch net-2007-03-03-All Models of Learning have Flaws
12 0.78252536 347 hunch net-2009-03-26-Machine Learning is too easy
13 0.78138113 220 hunch net-2006-11-27-Continuizing Solutions
14 0.78127688 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity
15 0.78049988 388 hunch net-2010-01-24-Specializations of the Master Problem
16 0.78043556 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?
17 0.78017491 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
18 0.77959108 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning
19 0.77923453 305 hunch net-2008-06-30-ICML has a comment system
20 0.77898985 286 hunch net-2008-01-25-Turing’s Club for Machine Learning