hunch_net hunch_net-2005 hunch_net-2005-12 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.
sentIndex sentText sentNum sentScore
1 One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. [sent-1, score-0.878]
2 As you travel down this list, the statements become stronger, but the scope of applicability decreases. [sent-2, score-0.422]
3 No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. [sent-3, score-1.034]
4 Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. [sent-4, score-0.802]
5 Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. [sent-5, score-0.991]
6 Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. [sent-6, score-0.162]
7 Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers. [sent-7, score-0.555]
8 Strong Constraints on the Data Source Bayes Learning When the data source is drawn from the prior, using Bayes law is optimal This doesn’t include all forms of learning theory, because I do not know them all. [sent-9, score-0.743]
9 If there are other bits you know of, please comment. [sent-10, score-0.088]
wordName wordTfidf (topN-words)
[('bayes', 0.321), ('source', 0.19), ('law', 0.189), ('data', 0.186), ('prediction', 0.18), ('performance', 0.173), ('equivalent', 0.17), ('constraints', 0.165), ('predict', 0.162), ('assumption', 0.158), ('iid', 0.158), ('assumptions', 0.156), ('meta', 0.146), ('applicability', 0.135), ('axiom', 0.128), ('computable', 0.128), ('list', 0.127), ('identically', 0.122), ('ability', 0.12), ('biases', 0.113), ('function', 0.112), ('length', 0.109), ('solution', 0.108), ('pac', 0.106), ('turing', 0.106), ('organize', 0.103), ('travel', 0.103), ('universal', 0.103), ('compete', 0.101), ('reach', 0.101), ('element', 0.099), ('agree', 0.099), ('uniform', 0.099), ('scope', 0.096), ('partial', 0.096), ('based', 0.096), ('distributed', 0.094), ('eventually', 0.094), ('convergence', 0.093), ('sets', 0.093), ('forms', 0.091), ('description', 0.089), ('theory', 0.088), ('bits', 0.088), ('statements', 0.088), ('stronger', 0.088), ('learning', 0.087), ('right', 0.085), ('well', 0.085), ('positive', 0.084)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999994 12 hunch net-2005-02-03-Learning Theory, by assumption
Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.
2 0.22780354 7 hunch net-2005-01-31-Watchword: Assumption
Introduction: “Assumption” is another word to be careful with in machine learning because it is used in several ways. Assumption = Bias There are several ways to see that some form of ‘bias’ (= preferring of one solution over another) is necessary. This is obvious in an adversarial setting. A good bit of work has been expended explaining this in other settings with “ no free lunch ” theorems. This is a usage specialized to learning which is particularly common when talking about priors for Bayesian Learning. Assumption = “if” of a theorem The assumptions are the ‘if’ part of the ‘if-then’ in a theorem. This is a fairly common usage. Assumption = Axiom The assumptions are the things that we assume are true, but which we cannot verify. Examples are “the IID assumption” or “my problem is a DNF on a small number of bits”. This is the usage which I prefer. One difficulty with any use of the word “assumption” is that you often encounter “if assumption then conclusion so if no
3 0.19285113 57 hunch net-2005-04-16-Which Assumptions are Reasonable?
Introduction: One of the most confusing things about understanding learning theory is the vast array of differing assumptions. Some critical thought about which of these assumptions are reasonable for real-world problems may be useful. Before we even start thinking about assumptions, it’s important to realize that the word has multiple meanings . The meaning used here is “assumption = axiom” (i.e. something you can not verify). Assumption Reasonable? Which analysis? Example/notes Independent and Identically Distributed Data Sometimes PAC,ERM,Prediction bounds,statistics The KDD cup 2004 physics dataset is plausibly IID data. There are a number of situations which are “almost IID” in the sense that IID analysis results in correct intuitions. Unreasonable in adversarial situations (stock market, war, etc…) Independently Distributed Data More than IID, but still only sometimes online->batch conversion Losing “identical” can be helpful in situations where you
4 0.17615095 235 hunch net-2007-03-03-All Models of Learning have Flaws
Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c
5 0.17610814 56 hunch net-2005-04-14-Families of Learning Theory Statements
Introduction: The diagram above shows a very broad viewpoint of learning theory. arrow Typical statement Examples Past->Past Some prediction algorithm A does almost as well as any of a set of algorithms. Weighted Majority Past->Future Assuming independent samples, past performance predicts future performance. PAC analysis, ERM analysis Future->Future Future prediction performance on subproblems implies future prediction performance using algorithm A . ECOC, Probing A basic question is: Are there other varieties of statements of this type? Avrim noted that there are also “arrows between arrows”: generic methods for transforming between Past->Past statements and Past->Future statements. Are there others?
6 0.17322038 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
7 0.15286048 347 hunch net-2009-03-26-Machine Learning is too easy
8 0.15162225 8 hunch net-2005-02-01-NIPS: Online Bayes
9 0.14372492 127 hunch net-2005-11-02-Progress in Active Learning
10 0.13789663 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)
11 0.132715 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions
12 0.13222758 160 hunch net-2006-03-02-Why do people count for learning?
13 0.13147677 28 hunch net-2005-02-25-Problem: Online Learning
14 0.12976345 14 hunch net-2005-02-07-The State of the Reduction
15 0.12612534 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning
16 0.12607527 19 hunch net-2005-02-14-Clever Methods of Overfitting
17 0.12580542 165 hunch net-2006-03-23-The Approximation Argument
18 0.11647186 133 hunch net-2005-11-28-A question of quantification
19 0.11465806 136 hunch net-2005-12-07-Is the Google way the way for machine learning?
20 0.11350372 143 hunch net-2005-12-27-Automated Labeling
topicId topicWeight
[(0, 0.248), (1, 0.181), (2, 0.009), (3, -0.038), (4, 0.063), (5, -0.056), (6, -0.006), (7, 0.058), (8, 0.095), (9, -0.052), (10, -0.024), (11, 0.034), (12, 0.141), (13, -0.021), (14, 0.033), (15, -0.1), (16, -0.054), (17, -0.086), (18, -0.067), (19, -0.076), (20, 0.078), (21, -0.131), (22, 0.024), (23, -0.128), (24, -0.037), (25, -0.106), (26, 0.061), (27, 0.073), (28, 0.073), (29, 0.067), (30, 0.119), (31, -0.013), (32, -0.113), (33, -0.008), (34, 0.078), (35, -0.042), (36, -0.018), (37, 0.083), (38, 0.167), (39, 0.034), (40, -0.026), (41, -0.004), (42, -0.014), (43, 0.007), (44, -0.019), (45, -0.077), (46, -0.074), (47, 0.018), (48, 0.037), (49, -0.004)]
simIndex simValue blogId blogTitle
same-blog 1 0.95774132 12 hunch net-2005-02-03-Learning Theory, by assumption
Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.
2 0.83745301 7 hunch net-2005-01-31-Watchword: Assumption
Introduction: “Assumption” is another word to be careful with in machine learning because it is used in several ways. Assumption = Bias There are several ways to see that some form of ‘bias’ (= preferring of one solution over another) is necessary. This is obvious in an adversarial setting. A good bit of work has been expended explaining this in other settings with “ no free lunch ” theorems. This is a usage specialized to learning which is particularly common when talking about priors for Bayesian Learning. Assumption = “if” of a theorem The assumptions are the ‘if’ part of the ‘if-then’ in a theorem. This is a fairly common usage. Assumption = Axiom The assumptions are the things that we assume are true, but which we cannot verify. Examples are “the IID assumption” or “my problem is a DNF on a small number of bits”. This is the usage which I prefer. One difficulty with any use of the word “assumption” is that you often encounter “if assumption then conclusion so if no
3 0.74406105 57 hunch net-2005-04-16-Which Assumptions are Reasonable?
Introduction: One of the most confusing things about understanding learning theory is the vast array of differing assumptions. Some critical thought about which of these assumptions are reasonable for real-world problems may be useful. Before we even start thinking about assumptions, it’s important to realize that the word has multiple meanings . The meaning used here is “assumption = axiom” (i.e. something you can not verify). Assumption Reasonable? Which analysis? Example/notes Independent and Identically Distributed Data Sometimes PAC,ERM,Prediction bounds,statistics The KDD cup 2004 physics dataset is plausibly IID data. There are a number of situations which are “almost IID” in the sense that IID analysis results in correct intuitions. Unreasonable in adversarial situations (stock market, war, etc…) Independently Distributed Data More than IID, but still only sometimes online->batch conversion Losing “identical” can be helpful in situations where you
4 0.70450324 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
5 0.66943312 160 hunch net-2006-03-02-Why do people count for learning?
Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat
6 0.66762149 133 hunch net-2005-11-28-A question of quantification
7 0.63856661 56 hunch net-2005-04-14-Families of Learning Theory Statements
8 0.62753677 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)
9 0.62045872 235 hunch net-2007-03-03-All Models of Learning have Flaws
10 0.61708987 347 hunch net-2009-03-26-Machine Learning is too easy
11 0.61480314 34 hunch net-2005-03-02-Prior, “Prior” and Bias
12 0.61027682 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning
13 0.59699845 43 hunch net-2005-03-18-Binomial Weighting
14 0.5931673 104 hunch net-2005-08-22-Do you believe in induction?
15 0.57734013 28 hunch net-2005-02-25-Problem: Online Learning
16 0.57376403 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms
17 0.55116272 136 hunch net-2005-12-07-Is the Google way the way for machine learning?
18 0.53560686 148 hunch net-2006-01-13-Benchmarks for RL
19 0.52333885 143 hunch net-2005-12-27-Automated Labeling
20 0.51795071 183 hunch net-2006-06-14-Explorations of Exploration
topicId topicWeight
[(3, 0.032), (24, 0.146), (27, 0.264), (38, 0.102), (53, 0.133), (55, 0.088), (94, 0.036), (95, 0.103)]
simIndex simValue blogId blogTitle
same-blog 1 0.92781758 12 hunch net-2005-02-03-Learning Theory, by assumption
Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.
2 0.89204878 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class
Introduction: Yann LeCun and I are coteaching a class on Large Scale Machine Learning starting late January at NYU . This class will cover many tricks to get machine learning working well on datasets with many features, examples, and classes, along with several elements of deep learning and support systems enabling the previous. This is not a beginning class—you really need to have taken a basic machine learning class previously to follow along. Students will be able to run and experiment with large scale learning algorithms since Yahoo! has donated servers which are being configured into a small scale Hadoop cluster. We are planning to cover the frontier of research in scalable learning algorithms, so good class projects could easily lead to papers. For me, this is a chance to teach on many topics of past research. In general, it seems like researchers should engage in at least occasional teaching of research, both as a proof of teachability and to see their own research through th
3 0.88677347 95 hunch net-2005-07-14-What Learning Theory might do
Introduction: I wanted to expand on this post and some of the previous problems/research directions about where learning theory might make large strides. Why theory? The essential reason for theory is “intuition extension”. A very good applied learning person can master some particular application domain yielding the best computer algorithms for solving that problem. A very good theory can take the intuitions discovered by this and other applied learning people and extend them to new domains in a relatively automatic fashion. To do this, we take these basic intuitions and try to find a mathematical model that: Explains the basic intuitions. Makes new testable predictions about how to learn. Succeeds in so learning. This is “intuition extension”: taking what we have learned somewhere else and applying it in new domains. It is fundamentally useful to everyone because it increases the level of automation in solving problems. Where next for learning theory? I like the a
4 0.87740225 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
Introduction: “Deep learning” is used to describe learning architectures which have significant depth (as a circuit). One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan ‘s class notes detail how XOR is not concisely representable by “AC0″ (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so. Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily.
5 0.87510461 131 hunch net-2005-11-16-The Everything Ensemble Edge
Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v
6 0.87312788 19 hunch net-2005-02-14-Clever Methods of Overfitting
7 0.87245399 194 hunch net-2006-07-11-New Models
8 0.87224966 370 hunch net-2009-09-18-Necessary and Sufficient Research
9 0.86297584 96 hunch net-2005-07-21-Six Months
10 0.85980791 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”
11 0.8580603 134 hunch net-2005-12-01-The Webscience Future
12 0.85426325 466 hunch net-2012-06-05-ICML acceptance statistics
13 0.85310209 360 hunch net-2009-06-15-In Active Learning, the question changes
14 0.85307568 235 hunch net-2007-03-03-All Models of Learning have Flaws
15 0.85048795 225 hunch net-2007-01-02-Retrospective
16 0.84974319 36 hunch net-2005-03-05-Funding Research
17 0.84790254 14 hunch net-2005-02-07-The State of the Reduction
18 0.84682208 201 hunch net-2006-08-07-The Call of the Deep
19 0.84654891 406 hunch net-2010-08-22-KDD 2010
20 0.84575438 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”