hunch_net hunch_net-2005 hunch_net-2005-12 knowledge-graph by maker-knowledge-mining

12 hunch net-2005-02-03-Learning Theory, by assumption


meta infos for this blog

Source: html

Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. [sent-1, score-0.878]

2 As you travel down this list, the statements become stronger, but the scope of applicability decreases. [sent-2, score-0.422]

3 No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. [sent-3, score-1.034]

4 Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. [sent-4, score-0.802]

5 Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. [sent-5, score-0.991]

6 Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. [sent-6, score-0.162]

7 Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers. [sent-7, score-0.555]

8 Strong Constraints on the Data Source Bayes Learning When the data source is drawn from the prior, using Bayes law is optimal This doesn’t include all forms of learning theory, because I do not know them all. [sent-9, score-0.743]

9 If there are other bits you know of, please comment. [sent-10, score-0.088]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('bayes', 0.321), ('source', 0.19), ('law', 0.189), ('data', 0.186), ('prediction', 0.18), ('performance', 0.173), ('equivalent', 0.17), ('constraints', 0.165), ('predict', 0.162), ('assumption', 0.158), ('iid', 0.158), ('assumptions', 0.156), ('meta', 0.146), ('applicability', 0.135), ('axiom', 0.128), ('computable', 0.128), ('list', 0.127), ('identically', 0.122), ('ability', 0.12), ('biases', 0.113), ('function', 0.112), ('length', 0.109), ('solution', 0.108), ('pac', 0.106), ('turing', 0.106), ('organize', 0.103), ('travel', 0.103), ('universal', 0.103), ('compete', 0.101), ('reach', 0.101), ('element', 0.099), ('agree', 0.099), ('uniform', 0.099), ('scope', 0.096), ('partial', 0.096), ('based', 0.096), ('distributed', 0.094), ('eventually', 0.094), ('convergence', 0.093), ('sets', 0.093), ('forms', 0.091), ('description', 0.089), ('theory', 0.088), ('bits', 0.088), ('statements', 0.088), ('stronger', 0.088), ('learning', 0.087), ('right', 0.085), ('well', 0.085), ('positive', 0.084)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 12 hunch net-2005-02-03-Learning Theory, by assumption

Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.

2 0.22780354 7 hunch net-2005-01-31-Watchword: Assumption

Introduction: “Assumption” is another word to be careful with in machine learning because it is used in several ways. Assumption = Bias There are several ways to see that some form of ‘bias’ (= preferring of one solution over another) is necessary. This is obvious in an adversarial setting. A good bit of work has been expended explaining this in other settings with “ no free lunch ” theorems. This is a usage specialized to learning which is particularly common when talking about priors for Bayesian Learning. Assumption = “if” of a theorem The assumptions are the ‘if’ part of the ‘if-then’ in a theorem. This is a fairly common usage. Assumption = Axiom The assumptions are the things that we assume are true, but which we cannot verify. Examples are “the IID assumption” or “my problem is a DNF on a small number of bits”. This is the usage which I prefer. One difficulty with any use of the word “assumption” is that you often encounter “if assumption then conclusion so if no

3 0.19285113 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

Introduction: One of the most confusing things about understanding learning theory is the vast array of differing assumptions. Some critical thought about which of these assumptions are reasonable for real-world problems may be useful. Before we even start thinking about assumptions, it’s important to realize that the word has multiple meanings . The meaning used here is “assumption = axiom” (i.e. something you can not verify). Assumption Reasonable? Which analysis? Example/notes Independent and Identically Distributed Data Sometimes PAC,ERM,Prediction bounds,statistics The KDD cup 2004 physics dataset is plausibly IID data. There are a number of situations which are “almost IID” in the sense that IID analysis results in correct intuitions. Unreasonable in adversarial situations (stock market, war, etc…) Independently Distributed Data More than IID, but still only sometimes online->batch conversion Losing “identical” can be helpful in situations where you

4 0.17615095 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

5 0.17610814 56 hunch net-2005-04-14-Families of Learning Theory Statements

Introduction: The diagram above shows a very broad viewpoint of learning theory. arrow Typical statement Examples Past->Past Some prediction algorithm A does almost as well as any of a set of algorithms. Weighted Majority Past->Future Assuming independent samples, past performance predicts future performance. PAC analysis, ERM analysis Future->Future Future prediction performance on subproblems implies future prediction performance using algorithm A . ECOC, Probing A basic question is: Are there other varieties of statements of this type? Avrim noted that there are also “arrows between arrows”: generic methods for transforming between Past->Past statements and Past->Future statements. Are there others?

6 0.17322038 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

7 0.15286048 347 hunch net-2009-03-26-Machine Learning is too easy

8 0.15162225 8 hunch net-2005-02-01-NIPS: Online Bayes

9 0.14372492 127 hunch net-2005-11-02-Progress in Active Learning

10 0.13789663 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

11 0.132715 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

12 0.13222758 160 hunch net-2006-03-02-Why do people count for learning?

13 0.13147677 28 hunch net-2005-02-25-Problem: Online Learning

14 0.12976345 14 hunch net-2005-02-07-The State of the Reduction

15 0.12612534 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

16 0.12607527 19 hunch net-2005-02-14-Clever Methods of Overfitting

17 0.12580542 165 hunch net-2006-03-23-The Approximation Argument

18 0.11647186 133 hunch net-2005-11-28-A question of quantification

19 0.11465806 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

20 0.11350372 143 hunch net-2005-12-27-Automated Labeling


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.248), (1, 0.181), (2, 0.009), (3, -0.038), (4, 0.063), (5, -0.056), (6, -0.006), (7, 0.058), (8, 0.095), (9, -0.052), (10, -0.024), (11, 0.034), (12, 0.141), (13, -0.021), (14, 0.033), (15, -0.1), (16, -0.054), (17, -0.086), (18, -0.067), (19, -0.076), (20, 0.078), (21, -0.131), (22, 0.024), (23, -0.128), (24, -0.037), (25, -0.106), (26, 0.061), (27, 0.073), (28, 0.073), (29, 0.067), (30, 0.119), (31, -0.013), (32, -0.113), (33, -0.008), (34, 0.078), (35, -0.042), (36, -0.018), (37, 0.083), (38, 0.167), (39, 0.034), (40, -0.026), (41, -0.004), (42, -0.014), (43, 0.007), (44, -0.019), (45, -0.077), (46, -0.074), (47, 0.018), (48, 0.037), (49, -0.004)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95774132 12 hunch net-2005-02-03-Learning Theory, by assumption

Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.

2 0.83745301 7 hunch net-2005-01-31-Watchword: Assumption

Introduction: “Assumption” is another word to be careful with in machine learning because it is used in several ways. Assumption = Bias There are several ways to see that some form of ‘bias’ (= preferring of one solution over another) is necessary. This is obvious in an adversarial setting. A good bit of work has been expended explaining this in other settings with “ no free lunch ” theorems. This is a usage specialized to learning which is particularly common when talking about priors for Bayesian Learning. Assumption = “if” of a theorem The assumptions are the ‘if’ part of the ‘if-then’ in a theorem. This is a fairly common usage. Assumption = Axiom The assumptions are the things that we assume are true, but which we cannot verify. Examples are “the IID assumption” or “my problem is a DNF on a small number of bits”. This is the usage which I prefer. One difficulty with any use of the word “assumption” is that you often encounter “if assumption then conclusion so if no

3 0.74406105 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

Introduction: One of the most confusing things about understanding learning theory is the vast array of differing assumptions. Some critical thought about which of these assumptions are reasonable for real-world problems may be useful. Before we even start thinking about assumptions, it’s important to realize that the word has multiple meanings . The meaning used here is “assumption = axiom” (i.e. something you can not verify). Assumption Reasonable? Which analysis? Example/notes Independent and Identically Distributed Data Sometimes PAC,ERM,Prediction bounds,statistics The KDD cup 2004 physics dataset is plausibly IID data. There are a number of situations which are “almost IID” in the sense that IID analysis results in correct intuitions. Unreasonable in adversarial situations (stock market, war, etc…) Independently Distributed Data More than IID, but still only sometimes online->batch conversion Losing “identical” can be helpful in situations where you

4 0.70450324 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is

5 0.66943312 160 hunch net-2006-03-02-Why do people count for learning?

Introduction: This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express: P(A,B,C) = P(A | B,C)P(B)P(C) What I don’t understand is the mechanism commonly used to estimate P(A | B, C) . If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to: P’(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)] … in other words, people just estimate P’(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C , but it (nat

6 0.66762149 133 hunch net-2005-11-28-A question of quantification

7 0.63856661 56 hunch net-2005-04-14-Families of Learning Theory Statements

8 0.62753677 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

9 0.62045872 235 hunch net-2007-03-03-All Models of Learning have Flaws

10 0.61708987 347 hunch net-2009-03-26-Machine Learning is too easy

11 0.61480314 34 hunch net-2005-03-02-Prior, “Prior” and Bias

12 0.61027682 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

13 0.59699845 43 hunch net-2005-03-18-Binomial Weighting

14 0.5931673 104 hunch net-2005-08-22-Do you believe in induction?

15 0.57734013 28 hunch net-2005-02-25-Problem: Online Learning

16 0.57376403 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms

17 0.55116272 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

18 0.53560686 148 hunch net-2006-01-13-Benchmarks for RL

19 0.52333885 143 hunch net-2005-12-27-Automated Labeling

20 0.51795071 183 hunch net-2006-06-14-Explorations of Exploration


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.032), (24, 0.146), (27, 0.264), (38, 0.102), (53, 0.133), (55, 0.088), (94, 0.036), (95, 0.103)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.92781758 12 hunch net-2005-02-03-Learning Theory, by assumption

Introduction: One way to organize learning theory is by assumption (in the assumption = axiom sense ), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases. No assumptions Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms. Universal Learning Using a “bias” of 2 - description length of turing machine in learning is equivalent to all other computable biases up to some constant. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems. Independent and Identically Distributed (IID) Data Performance Prediction Based upon past performance, you can predict future performance. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.

2 0.89204878 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class

Introduction: Yann LeCun and I are coteaching a class on Large Scale Machine Learning starting late January at NYU . This class will cover many tricks to get machine learning working well on datasets with many features, examples, and classes, along with several elements of deep learning and support systems enabling the previous. This is not a beginning class—you really need to have taken a basic machine learning class previously to follow along. Students will be able to run and experiment with large scale learning algorithms since Yahoo! has donated servers which are being configured into a small scale Hadoop cluster. We are planning to cover the frontier of research in scalable learning algorithms, so good class projects could easily lead to papers. For me, this is a chance to teach on many topics of past research. In general, it seems like researchers should engage in at least occasional teaching of research, both as a proof of teachability and to see their own research through th

3 0.88677347 95 hunch net-2005-07-14-What Learning Theory might do

Introduction: I wanted to expand on this post and some of the previous problems/research directions about where learning theory might make large strides. Why theory? The essential reason for theory is “intuition extension”. A very good applied learning person can master some particular application domain yielding the best computer algorithms for solving that problem. A very good theory can take the intuitions discovered by this and other applied learning people and extend them to new domains in a relatively automatic fashion. To do this, we take these basic intuitions and try to find a mathematical model that: Explains the basic intuitions. Makes new testable predictions about how to learn. Succeeds in so learning. This is “intuition extension”: taking what we have learned somewhere else and applying it in new domains. It is fundamentally useful to everyone because it increases the level of automation in solving problems. Where next for learning theory? I like the a

4 0.87740225 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

Introduction: “Deep learning” is used to describe learning architectures which have significant depth (as a circuit). One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan ‘s class notes detail how XOR is not concisely representable by “AC0″ (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so. Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily.

5 0.87510461 131 hunch net-2005-11-16-The Everything Ensemble Edge

Introduction: Rich Caruana , Alexandru Niculescu , Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is: Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression . For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method. A series of conclusions can be drawn from the observations. ( Calibrated ) boosted decision trees appear to perform best, in general although support v

6 0.87312788 19 hunch net-2005-02-14-Clever Methods of Overfitting

7 0.87245399 194 hunch net-2006-07-11-New Models

8 0.87224966 370 hunch net-2009-09-18-Necessary and Sufficient Research

9 0.86297584 96 hunch net-2005-07-21-Six Months

10 0.85980791 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

11 0.8580603 134 hunch net-2005-12-01-The Webscience Future

12 0.85426325 466 hunch net-2012-06-05-ICML acceptance statistics

13 0.85310209 360 hunch net-2009-06-15-In Active Learning, the question changes

14 0.85307568 235 hunch net-2007-03-03-All Models of Learning have Flaws

15 0.85048795 225 hunch net-2007-01-02-Retrospective

16 0.84974319 36 hunch net-2005-03-05-Funding Research

17 0.84790254 14 hunch net-2005-02-07-The State of the Reduction

18 0.84682208 201 hunch net-2006-08-07-The Call of the Deep

19 0.84654891 406 hunch net-2010-08-22-KDD 2010

20 0.84575438 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”