hunch_net hunch_net-2006 hunch_net-2006-163 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: In the online learning with experts setting, you observe a set of predictions, make a decision, and then observe the truth. This process repeats indefinitely. In this setting, it is possible to prove theorems of the sort: master algorithm error count < = k* best predictor error count + c*log(number of predictors) Is this a statement about learning or about preservation of learning? We did some experiments to analyze the new Binning algorithm which works in this setting. For several UCI datasets, we reprocessed them so that features could be used as predictors and then applied several master algorithms. The first graph confirms that Binning is indeed a better algorithm according to the tightness of the upper bound. Here, “Best” is the performance of the best expert. “V. Bound” is the bound for Vovk ‘s algorithm (the previous best). “Bound” is the bound for the Binning algorithm. “Binning” is the performance of the Binning algorithm. The Binning algorithm clearly h
sentIndex sentText sentNum sentScore
1 In the online learning with experts setting, you observe a set of predictions, make a decision, and then observe the truth. [sent-1, score-0.368]
2 In this setting, it is possible to prove theorems of the sort: master algorithm error count < = k* best predictor error count + c*log(number of predictors) Is this a statement about learning or about preservation of learning? [sent-3, score-0.799]
3 We did some experiments to analyze the new Binning algorithm which works in this setting. [sent-4, score-0.136]
4 For several UCI datasets, we reprocessed them so that features could be used as predictors and then applied several master algorithms. [sent-5, score-0.183]
5 The first graph confirms that Binning is indeed a better algorithm according to the tightness of the upper bound. [sent-6, score-0.438]
6 Here, “Best” is the performance of the best expert. [sent-7, score-0.294]
7 Bound” is the bound for Vovk ‘s algorithm (the previous best). [sent-9, score-0.436]
8 “Binning” is the performance of the Binning algorithm. [sent-11, score-0.191]
9 The Binning algorithm clearly has a tighter bound, and the performance bound is clearly a sharp constraint on the algorithm performance. [sent-12, score-0.89]
10 “Bin” is the performance of Binning (identical to the previous graph). [sent-14, score-0.242]
11 BW is Binomial weighting , which is (roughly) the deterministic version of Binning. [sent-15, score-0.159]
12 Both BW and WM are deterministic algorithms which implies their performance bounds are perhaps a factor of 2 worse than Binning or Vovk’s algorithm. [sent-17, score-0.404]
13 In contrast, the actual performance (rather than performance bound) of the deterministic algorithms is sometimes even better than the best expert (negative regret? [sent-18, score-0.644]
14 A consistent negative correlation between “online bound tightness” and “learning performance” is observed. [sent-21, score-0.368]
15 This is not convincing because many other learning algorithm do as well or better with the given features/experts. [sent-27, score-0.222]
16 Another possibility is “you can start out running Binning, and when it pulls ahead of it’s bound run any learning algorithm. [sent-28, score-0.305]
17 If the learning algorithm does badly, you can switch back to Binning and preserve the guarantee. [sent-29, score-0.28]
18 My best current understanding is that “online learning with experts” is really “online preservation of learning”: the goal of the algorithm is to preserve whatever predictive ability the individual predictors have. [sent-31, score-0.732]
19 For example, charity events sometimes work according to the following form: All participants exchange dollars for bogobucks. [sent-34, score-0.322]
20 An online preservation algorithm has the property that if you acquire enough bogobucks in comparison to the number of participants, you can guarantee winning the prize. [sent-37, score-0.575]
wordName wordTfidf (topN-words)
[('binning', 0.666), ('bound', 0.249), ('preservation', 0.242), ('performance', 0.191), ('deterministic', 0.159), ('bw', 0.136), ('wm', 0.136), ('algorithm', 0.136), ('tightness', 0.106), ('participants', 0.106), ('experts', 0.104), ('best', 0.103), ('predictors', 0.101), ('vovk', 0.101), ('uci', 0.097), ('preserve', 0.097), ('online', 0.096), ('convincing', 0.086), ('count', 0.086), ('observe', 0.084), ('winner', 0.082), ('master', 0.082), ('graph', 0.078), ('negative', 0.069), ('clearly', 0.065), ('statement', 0.064), ('confirms', 0.061), ('examining', 0.061), ('scenarios', 0.061), ('according', 0.057), ('gamble', 0.056), ('ahead', 0.056), ('bin', 0.056), ('charity', 0.056), ('safety', 0.056), ('bounds', 0.054), ('wrong', 0.054), ('binomial', 0.053), ('dollars', 0.053), ('environments', 0.053), ('reply', 0.053), ('whatever', 0.053), ('winning', 0.053), ('previous', 0.051), ('exchange', 0.05), ('correlation', 0.05), ('repeats', 0.05), ('acquire', 0.048), ('sharp', 0.048), ('switch', 0.047)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999994 163 hunch net-2006-03-12-Online learning or online preservation of learning?
Introduction: In the online learning with experts setting, you observe a set of predictions, make a decision, and then observe the truth. This process repeats indefinitely. In this setting, it is possible to prove theorems of the sort: master algorithm error count < = k* best predictor error count + c*log(number of predictors) Is this a statement about learning or about preservation of learning? We did some experiments to analyze the new Binning algorithm which works in this setting. For several UCI datasets, we reprocessed them so that features could be used as predictors and then applied several master algorithms. The first graph confirms that Binning is indeed a better algorithm according to the tightness of the upper bound. Here, “Best” is the performance of the best expert. “V. Bound” is the bound for Vovk ‘s algorithm (the previous best). “Bound” is the bound for the Binning algorithm. “Binning” is the performance of the Binning algorithm. The Binning algorithm clearly h
2 0.21466592 218 hunch net-2006-11-20-Context and the calculation misperception
Introduction: This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it. Suppose we have a set of events, each described by a vector of features. 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0 Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7. There are two issues here. The first is that this is really a training
3 0.15569404 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability
Introduction: Accountability is a social problem. When someone screws up, do you fire them? Or do you accept the error and let them continue? This is a very difficult problem and we all know of stories where the wrong decision was made. Online learning (as meant here), is a subfield of learning theory which analyzes the online learning model. In the online learning model, there are a set of hypotheses or “experts”. On any instantance x , each expert makes a prediction y . A master algorithm A uses these predictions to form it’s own prediction y A and then learns the correct prediction y * . This process repeats. The goal of online learning is to find a master algorithm A which uses the advice of the experts to make good predictions. In particular, we typically want to guarantee that the master algorithm performs almost as well as the best expert. If L(e) is the loss of expert e and L(A) is the loss of the master algorithm, it is often possible to prove: L(A) les
4 0.1430719 41 hunch net-2005-03-15-The State of Tight Bounds
Introduction: What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals. Why? Good Judgement . In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not. Learning Essence . The form of some of these bounds helps you understand what the essence of learning is. Algorithm Design . Some of these bounds suggest, motivate, or even directly imply learning algorithms. What We Know Now There are several families of bounds, based on how information is used. Testing Bounds . These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound , progressive validation also here and here , train and test bounds , and cross-validation (but see the big open problem ). These tec
5 0.13013148 244 hunch net-2007-05-09-The Missing Bound
Introduction: Sham Kakade points out that we are missing a bound. Suppose we have m samples x drawn IID from some distribution D . Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0.5 ) (at least in crude form). A proof is on page 9 here . If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small. What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning,
6 0.1201584 19 hunch net-2005-02-14-Clever Methods of Overfitting
7 0.1157653 170 hunch net-2006-04-06-Bounds greater than 1
8 0.11285871 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms
9 0.11014093 85 hunch net-2005-06-28-A COLT paper
10 0.10931051 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science
11 0.10887502 8 hunch net-2005-02-01-NIPS: Online Bayes
12 0.10519333 131 hunch net-2005-11-16-The Everything Ensemble Edge
13 0.10126179 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
14 0.10123608 252 hunch net-2007-07-01-Watchword: Online Learning
15 0.09555497 368 hunch net-2009-08-26-Another 10-year paper in Machine Learning
16 0.095296539 388 hunch net-2010-01-24-Specializations of the Master Problem
17 0.093142942 28 hunch net-2005-02-25-Problem: Online Learning
18 0.092384353 258 hunch net-2007-08-12-Exponentiated Gradient
19 0.092111811 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
20 0.09165448 83 hunch net-2005-06-18-Lower Bounds for Learning Reductions
topicId topicWeight
[(0, 0.172), (1, 0.126), (2, 0.043), (3, -0.044), (4, 0.047), (5, -0.058), (6, 0.035), (7, -0.022), (8, -0.029), (9, 0.037), (10, 0.033), (11, 0.144), (12, 0.144), (13, -0.11), (14, 0.065), (15, -0.012), (16, 0.021), (17, -0.04), (18, -0.053), (19, 0.052), (20, 0.048), (21, -0.031), (22, -0.104), (23, 0.002), (24, 0.031), (25, 0.036), (26, 0.032), (27, 0.033), (28, 0.097), (29, -0.026), (30, -0.057), (31, 0.013), (32, 0.017), (33, -0.014), (34, -0.033), (35, -0.019), (36, 0.03), (37, -0.007), (38, 0.026), (39, -0.002), (40, -0.1), (41, 0.042), (42, -0.071), (43, 0.008), (44, 0.022), (45, -0.016), (46, 0.022), (47, 0.019), (48, -0.07), (49, 0.033)]
simIndex simValue blogId blogTitle
same-blog 1 0.95806068 163 hunch net-2006-03-12-Online learning or online preservation of learning?
Introduction: In the online learning with experts setting, you observe a set of predictions, make a decision, and then observe the truth. This process repeats indefinitely. In this setting, it is possible to prove theorems of the sort: master algorithm error count < = k* best predictor error count + c*log(number of predictors) Is this a statement about learning or about preservation of learning? We did some experiments to analyze the new Binning algorithm which works in this setting. For several UCI datasets, we reprocessed them so that features could be used as predictors and then applied several master algorithms. The first graph confirms that Binning is indeed a better algorithm according to the tightness of the upper bound. Here, “Best” is the performance of the best expert. “V. Bound” is the bound for Vovk ‘s algorithm (the previous best). “Bound” is the bound for the Binning algorithm. “Binning” is the performance of the Binning algorithm. The Binning algorithm clearly h
2 0.75031948 85 hunch net-2005-06-28-A COLT paper
Introduction: I found Tong Zhang’s paper on Data Dependent Concentration Bounds for Sequential Prediction Algorithms interesting. Roughly speaking, it states a tight bound on the future error rate for online learning algorithms assuming that samples are drawn independently. This bound is easily computed and will make the progressive validation approaches used here significantly more practical.
3 0.69350302 170 hunch net-2006-04-06-Bounds greater than 1
Introduction: Nati Srebro and Shai Ben-David have a paper at COLT which, in the appendix, proves something very striking: several previous error bounds are always greater than 1. Background One branch of learning theory focuses on theorems which Assume samples are drawn IID from an unknown distribution D . Fix a set of classifiers Find a high probability bound on the maximum true error rate (with respect to D ) as a function of the empirical error rate on the training set. Many of these bounds become extremely complex and hairy. Current Everyone working on this subject wants “tighter bounds”, however there are different definitions of “tighter”. Some groups focus on “functional tightness” (getting the right functional dependency between the size of the training set and a parameterization of the hypothesis space) while others focus on “practical tightness” (finding bounds which work well on practical problems). (I am definitely in the second camp.) One of the da
4 0.68381929 41 hunch net-2005-03-15-The State of Tight Bounds
Introduction: What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals. Why? Good Judgement . In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not. Learning Essence . The form of some of these bounds helps you understand what the essence of learning is. Algorithm Design . Some of these bounds suggest, motivate, or even directly imply learning algorithms. What We Know Now There are several families of bounds, based on how information is used. Testing Bounds . These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound , progressive validation also here and here , train and test bounds , and cross-validation (but see the big open problem ). These tec
5 0.63329345 78 hunch net-2005-06-06-Exact Online Learning for Classification
Introduction: Jacob Abernethy and I have found a computationally tractable method for computing an optimal (or near optimal depending on setting) master algorithm combining expert predictions addressing this open problem . A draft is here . The effect of this improvement seems to be about a factor of 2 decrease in the regret (= error rate minus best possible error rate) for the low error rate situation. (At large error rates, there may be no significant difference.) There are some unfinished details still to consider: When we remove all of the approximation slack from online learning, is the result a satisfying learning algorithm, in practice? I consider online learning is one of the more compelling methods of analyzing and deriving algorithms, but that expectation must be either met or not by this algorithm Some extra details: The algorithm is optimal given a small amount of side information ( k in the draft). What is the best way to remove this side information? The removal
6 0.63157207 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science
7 0.62953502 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability
8 0.62306017 244 hunch net-2007-05-09-The Missing Bound
9 0.60739511 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
10 0.60097677 28 hunch net-2005-02-25-Problem: Online Learning
11 0.58750707 133 hunch net-2005-11-28-A question of quantification
12 0.57435292 43 hunch net-2005-03-18-Binomial Weighting
13 0.54748875 19 hunch net-2005-02-14-Clever Methods of Overfitting
14 0.54108214 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms
15 0.53868014 56 hunch net-2005-04-14-Families of Learning Theory Statements
16 0.52636975 131 hunch net-2005-11-16-The Everything Ensemble Edge
17 0.52631187 35 hunch net-2005-03-04-The Big O and Constants in Learning
18 0.52558815 26 hunch net-2005-02-21-Problem: Cross Validation
19 0.51815027 8 hunch net-2005-02-01-NIPS: Online Bayes
20 0.51716334 220 hunch net-2006-11-27-Continuizing Solutions
topicId topicWeight
[(3, 0.029), (25, 0.287), (27, 0.17), (30, 0.011), (38, 0.089), (46, 0.019), (53, 0.094), (55, 0.073), (77, 0.016), (94, 0.103), (95, 0.01)]
simIndex simValue blogId blogTitle
1 0.8804571 178 hunch net-2006-05-08-Big machine learning
Introduction: According to the New York Times , Yahoo is releasing Project Panama shortly . Project Panama is about better predicting which advertisements are relevant to a search, implying a higher click through rate, implying larger income for Yahoo . There are two things that seem interesting here: A significant portion of that improved accuracy is almost certainly machine learning at work. The quantitative effect is huge—the estimate in the article is $600*10 6 . Google already has such improvements and Microsoft Search is surely working on them, which suggest this is (perhaps) a $10 9 per year machine learning problem. The exact methodology under use is unlikely to be publicly discussed in the near future because of the competitive enivironment. Hopefully we’ll have some public “war stories” at some point in the future when this information becomes less sensitive. For now, it’s reassuring to simply note that machine learning is having a big impact.
same-blog 2 0.84133226 163 hunch net-2006-03-12-Online learning or online preservation of learning?
Introduction: In the online learning with experts setting, you observe a set of predictions, make a decision, and then observe the truth. This process repeats indefinitely. In this setting, it is possible to prove theorems of the sort: master algorithm error count < = k* best predictor error count + c*log(number of predictors) Is this a statement about learning or about preservation of learning? We did some experiments to analyze the new Binning algorithm which works in this setting. For several UCI datasets, we reprocessed them so that features could be used as predictors and then applied several master algorithms. The first graph confirms that Binning is indeed a better algorithm according to the tightness of the upper bound. Here, “Best” is the performance of the best expert. “V. Bound” is the bound for Vovk ‘s algorithm (the previous best). “Bound” is the bound for the Binning algorithm. “Binning” is the performance of the Binning algorithm. The Binning algorithm clearly h
3 0.80340719 148 hunch net-2006-01-13-Benchmarks for RL
Introduction: A couple years ago, Drew Bagnell and I started the RLBench project to setup a suite of reinforcement learning benchmark problems. We haven’t been able to touch it (due to lack of time) for a year so the project is on hold. Luckily, there are several other projects such as CLSquare and RL-Glue with a similar goal, and we strongly endorse their continued development. I would like to explain why, especially in the context of criticism of other learning benchmarks. For example, sometimes the UCI Machine Learning Repository is criticized. There are two criticisms I know of: Learning algorithms have overfit to the problems in the repository. It is easy to imagine a mechanism for this happening unintentionally. Strong evidence of this would be provided by learning algorithms which perform great on the UCI machine learning repository but very badly (relative to other learning algorithms) on non-UCI learning problems. I have seen little evidence of this but it remains a po
4 0.76517522 134 hunch net-2005-12-01-The Webscience Future
Introduction: The internet has significantly effected the way we do research but it’s capabilities have not yet been fully realized. First, let’s acknowledge some known effects. Self-publishing By default, all researchers in machine learning (and more generally computer science and physics) place their papers online for anyone to download. The exact mechanism differs—physicists tend to use a central repository ( Arxiv ) while computer scientists tend to place the papers on their webpage. Arxiv has been slowly growing in subject breadth so it now sometimes used by computer scientists. Collaboration Email has enabled working remotely with coauthors. This has allowed collaborationis which would not otherwise have been possible and generally speeds research. Now, let’s look at attempts to go further. Blogs (like this one) allow public discussion about topics which are not easily categorized as “a new idea in machine learning” (like this topic). Organization of some subfield
5 0.62467545 286 hunch net-2008-01-25-Turing’s Club for Machine Learning
Introduction: Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions: How does the learning algorithm scale with the number of examples m ? Any algorithm using all of the data is at least O(m) , but in many cases this is O(m 2 ) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled. The above question can also be asked for test cases. In some applications, test-time performance is of great importance. How does the algorithm scale with the number of
6 0.62387097 131 hunch net-2005-11-16-The Everything Ensemble Edge
7 0.62280458 19 hunch net-2005-02-14-Clever Methods of Overfitting
8 0.61779839 95 hunch net-2005-07-14-What Learning Theory might do
9 0.61406344 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
10 0.61363602 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
11 0.61296642 177 hunch net-2006-05-05-An ICML reject
12 0.61280107 237 hunch net-2007-04-02-Contextual Scaling
13 0.61207324 78 hunch net-2005-06-06-Exact Online Learning for Classification
14 0.61204445 297 hunch net-2008-04-22-Taking the next step
15 0.61151648 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
16 0.60968369 233 hunch net-2007-02-16-The Forgetting
17 0.60958308 98 hunch net-2005-07-27-Not goal metrics
18 0.60929751 306 hunch net-2008-07-02-Proprietary Data in Academic Research?
19 0.60836202 49 hunch net-2005-03-30-What can Type Theory teach us about Machine Learning?
20 0.60831863 351 hunch net-2009-05-02-Wielding a New Abstraction