hunch_net hunch_net-2007 hunch_net-2007-235 knowledge-graph by maker-knowledge-mining

235 hunch net-2007-03-03-All Models of Learning have Flaws


meta infos for this blog

Source: html

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 I’ve created a table (below) outlining the major flaws in some common models of machine learning. [sent-3, score-0.274]

2 It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. [sent-7, score-0.336]

3 It’s common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. [sent-17, score-0.626]

4 Name Methodology What’s right What’s wrong Bayesian Learning You specify a prior probability distribution over data-makers, P(datamaker) then use Bayes law to find a posterior P(datamaker|x) . [sent-23, score-0.452]

5 True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly. [sent-24, score-0.399]

6 Partly due to the difficulties above and partly because “first specify a prior” is built into framework this approach is not very automatable. [sent-32, score-0.462]

7 In real world applications, true conditional independence is rare, and results degrade rapidly with systematic misspecification of conditional independence. [sent-40, score-0.276]

8 Convex Loss Optimization Specify a loss function related to the world-imposed loss fucntion which is convex on some parametric predictive system. [sent-41, score-0.514]

9 The temptation to forget that the world imposes nonconvex loss functions is sometimes overwhelming, and the mismatch is always dangerous. [sent-45, score-0.362]

10 Although switching to a convex loss means that some optimizations become convex, optimization on representations which aren’t single layer linear combinations is often difficult. [sent-47, score-0.323]

11 Relatively computationally tractable due to (a) modularity of gradient descent (b) directly optimizing the quantity you want to predict. [sent-49, score-0.265]

12 People often find the specification of a similarity function between objects a natural way to incorporate prior information for machine learning problems. [sent-55, score-0.549]

13 Learning Reductions You solve complex machine learning problems by reducing them to well-studied base problems in a robust manner. [sent-68, score-0.335]

14 The reductions approach can yield highly automated learning algorithms. [sent-69, score-0.356]

15 You think of learning as finding a near-best hypothesis amongst a given set of hypotheses in a computationally tractable manner. [sent-73, score-0.28]

16 You think of learning as figuring out the number of samples required to distinguish a near-best hypothesis from a set of hypotheses. [sent-78, score-0.286]

17 Any reasonable problem is learnable with a number of samples related to the description length of the program. [sent-89, score-0.283]

18 RL, MDP learning Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process. [sent-91, score-0.376]

19 Has anyone counted the number of states in real world problems? [sent-93, score-0.267]

20 RL, POMDP learning Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process In a sense, we’ve made no assumptions, so algorithms have wide applicability. [sent-97, score-0.368]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('pomdp', 0.176), ('framework', 0.169), ('models', 0.149), ('prior', 0.142), ('convex', 0.133), ('samples', 0.13), ('flaws', 0.125), ('automated', 0.124), ('specify', 0.124), ('datamaker', 0.118), ('pac', 0.114), ('iid', 0.113), ('posterior', 0.111), ('loss', 0.104), ('computationally', 0.104), ('unknown', 0.103), ('world', 0.098), ('acting', 0.097), ('frameworks', 0.097), ('finding', 0.095), ('algorithms', 0.095), ('states', 0.094), ('base', 0.094), ('partly', 0.091), ('decision', 0.089), ('conditional', 0.089), ('predictive', 0.089), ('bayesian', 0.087), ('descent', 0.086), ('often', 0.086), ('existence', 0.084), ('similarity', 0.084), ('alone', 0.084), ('parametric', 0.084), ('sometimes', 0.082), ('limited', 0.082), ('learning', 0.081), ('specification', 0.081), ('correctly', 0.081), ('problems', 0.08), ('predictions', 0.079), ('length', 0.078), ('forget', 0.078), ('approach', 0.078), ('point', 0.078), ('importantly', 0.076), ('find', 0.075), ('number', 0.075), ('gradient', 0.075), ('reductions', 0.073)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

2 0.24044935 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

3 0.23978557 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

4 0.19656752 347 hunch net-2009-03-26-Machine Learning is too easy

Introduction: One of the remarkable things about machine learning is how diverse it is. The viewpoints of Bayesian learning, reinforcement learning, graphical models, supervised learning, unsupervised learning, genetic programming, etc… share little enough overlap that many people can and do make their careers within one without touching, or even necessarily understanding the others. There are two fundamental reasons why this is possible. For many problems, many approaches work in the sense that they do something useful. This is true empirically, where for many problems we can observe that many different approaches yield better performance than any constant predictor. It’s also true in theory, where we know that for any set of predictors representable in a finite amount of RAM, minimizing training error over the set of predictors does something nontrivial when there are a sufficient number of examples. There is nothing like a unifying problem defining the field. In many other areas there

5 0.1934406 237 hunch net-2007-04-02-Contextual Scaling

Introduction: Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following: Marginal . In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information. First context . A clever person realizes that some bit of information x 1 could be helpful. If x 1 is discrete, they condition on it and make a predictor h(x 1 ) , typically by counting. If they are clever, then they also do some smoothing. If x 1 is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand. Second . Another clever person (or perhaps the s

6 0.19090711 388 hunch net-2010-01-24-Specializations of the Master Problem

7 0.18369478 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

8 0.18301791 332 hunch net-2008-12-23-Use of Learning Theory

9 0.17615095 12 hunch net-2005-02-03-Learning Theory, by assumption

10 0.17478926 183 hunch net-2006-06-14-Explorations of Exploration

11 0.17215139 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

12 0.17052461 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

13 0.16964935 9 hunch net-2005-02-01-Watchword: Loss

14 0.16847232 95 hunch net-2005-07-14-What Learning Theory might do

15 0.16664104 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

16 0.16589342 351 hunch net-2009-05-02-Wielding a New Abstraction

17 0.16577521 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

18 0.16524903 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

19 0.16065449 258 hunch net-2007-08-12-Exponentiated Gradient

20 0.15641734 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.402), (1, 0.208), (2, 0.011), (3, -0.025), (4, 0.01), (5, -0.037), (6, -0.049), (7, 0.068), (8, 0.162), (9, -0.028), (10, 0.022), (11, -0.107), (12, 0.016), (13, 0.019), (14, 0.083), (15, -0.048), (16, 0.023), (17, -0.08), (18, 0.012), (19, -0.106), (20, 0.008), (21, -0.029), (22, 0.001), (23, -0.041), (24, -0.071), (25, 0.026), (26, -0.014), (27, 0.046), (28, 0.087), (29, 0.018), (30, -0.031), (31, -0.056), (32, -0.053), (33, 0.015), (34, 0.014), (35, -0.027), (36, -0.07), (37, -0.108), (38, -0.003), (39, 0.017), (40, -0.041), (41, -0.026), (42, 0.041), (43, -0.054), (44, 0.007), (45, 0.023), (46, -0.01), (47, -0.035), (48, 0.035), (49, -0.036)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9601053 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

2 0.80106395 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

3 0.79403758 237 hunch net-2007-04-02-Contextual Scaling

Introduction: Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following: Marginal . In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information. First context . A clever person realizes that some bit of information x 1 could be helpful. If x 1 is discrete, they condition on it and make a predictor h(x 1 ) , typically by counting. If they are clever, then they also do some smoothing. If x 1 is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand. Second . Another clever person (or perhaps the s

4 0.77445316 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

Introduction: I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of: Specify a prior over world models. Integrate using Bayes law with respect to all observed information to compute a posterior over world models. Predict according to the posterior. Bayesian learning has many advantages over other learning programs: Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe

5 0.74538589 95 hunch net-2005-07-14-What Learning Theory might do

Introduction: I wanted to expand on this post and some of the previous problems/research directions about where learning theory might make large strides. Why theory? The essential reason for theory is “intuition extension”. A very good applied learning person can master some particular application domain yielding the best computer algorithms for solving that problem. A very good theory can take the intuitions discovered by this and other applied learning people and extend them to new domains in a relatively automatic fashion. To do this, we take these basic intuitions and try to find a mathematical model that: Explains the basic intuitions. Makes new testable predictions about how to learn. Succeeds in so learning. This is “intuition extension”: taking what we have learned somewhere else and applying it in new domains. It is fundamentally useful to everyone because it increases the level of automation in solving problems. Where next for learning theory? I like the a

6 0.74107486 347 hunch net-2009-03-26-Machine Learning is too easy

7 0.7335096 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

8 0.71407902 253 hunch net-2007-07-06-Idempotent-capable Predictors

9 0.70817512 160 hunch net-2006-03-02-Why do people count for learning?

10 0.69996238 351 hunch net-2009-05-02-Wielding a New Abstraction

11 0.67851335 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

12 0.67528605 217 hunch net-2006-11-06-Data Linkage Problems

13 0.67435676 28 hunch net-2005-02-25-Problem: Online Learning

14 0.66782969 332 hunch net-2008-12-23-Use of Learning Theory

15 0.66758025 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

16 0.66465789 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms

17 0.65548497 12 hunch net-2005-02-03-Learning Theory, by assumption

18 0.64442676 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

19 0.63892603 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

20 0.63484794 286 hunch net-2008-01-25-Turing’s Club for Machine Learning


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.011), (3, 0.02), (10, 0.023), (27, 0.253), (30, 0.013), (38, 0.075), (49, 0.015), (51, 0.138), (53, 0.082), (55, 0.071), (77, 0.047), (93, 0.016), (94, 0.094), (95, 0.07)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.96973783 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

Introduction: The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page ). Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics , including a response by Yoav Freund and Robert Schapire . I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties. Comparing wit

2 0.96584868 334 hunch net-2009-01-07-Interesting Papers at SODA 2009

Introduction: Several talks seem potentially interesting to ML folks at this year’s SODA. Maria-Florina Balcan , Avrim Blum , and Anupam Gupta , Approximate Clustering without the Approximation . This paper gives reasonable algorithms with provable approximation guarantees for k-median and other notions of clustering. It’s conceptually interesting, because it’s the second example I’ve seen where NP hardness is subverted by changing the problem definition subtle but reasonable way. Essentially, they show that if any near-approximation to an optimal solution is good, then it’s computationally easy to find a near-optimal solution. This subtle shift bears serious thought. A similar one occurred in our ranking paper with respect to minimum feedback arcset. With two known examples, it suggests that many more NP-complete problems might be finessed into irrelevance in this style. Yury Lifshits and Shengyu Zhang , Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates, and Smal

3 0.95590264 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge

Introduction: The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured. From the instructions page , several issues come up: Large Definition My personal definition of dataset size is: small A dataset is small if a human could look at the dataset and plausibly find a good solution. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer. large A large dataset does not fit in the RAM of a reasonably priced computer. By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that. The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require ran

same-blog 4 0.9503535 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

5 0.94451308 393 hunch net-2010-04-14-MLcomp: a website for objectively comparing ML algorithms

Introduction: Much of the success and popularity of machine learning has been driven by its practical impact. Of course, the evaluation of empirical work is an integral part of the field. But are the existing mechanisms for evaluating algorithms and comparing results good enough? We ( Percy and Jake ) believe there are currently a number of shortcomings: Incomplete Disclosure: You read a paper that proposes Algorithm A which is shown to outperform SVMs on two datasets.  Great.  But what about on other datasets?  How sensitive is this result?   What about compute time – does the algorithm take two seconds on a laptop or two weeks on a 100-node cluster? Lack of Standardization: Algorithm A beats Algorithm B on one version of a dataset.  Algorithm B beats Algorithm A on another version yet uses slightly different preprocessing.  Though doing a head-on comparison would be ideal, it would be tedious since the programs probably use different dataset formats and have a large array of options

6 0.89421505 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

7 0.89337003 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

8 0.89301366 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

9 0.89237529 259 hunch net-2007-08-19-Choice of Metrics

10 0.88891047 12 hunch net-2005-02-03-Learning Theory, by assumption

11 0.88846844 258 hunch net-2007-08-12-Exponentiated Gradient

12 0.88701785 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

13 0.88421804 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

14 0.88363975 360 hunch net-2009-06-15-In Active Learning, the question changes

15 0.88350815 351 hunch net-2009-05-02-Wielding a New Abstraction

16 0.88220042 95 hunch net-2005-07-14-What Learning Theory might do

17 0.88201451 131 hunch net-2005-11-16-The Everything Ensemble Edge

18 0.88064635 143 hunch net-2005-12-27-Automated Labeling

19 0.87881011 19 hunch net-2005-02-14-Clever Methods of Overfitting

20 0.87850142 406 hunch net-2010-08-22-KDD 2010