hunch_net hunch_net-2005 hunch_net-2005-123 knowledge-graph by maker-knowledge-mining

123 hunch net-2005-10-16-Complexity: It’s all in your head


meta infos for this blog

Source: html

Introduction: One of the central concerns of learning is to understand and to prevent overfitting. Various notion of “function complexity” often arise: VC dimension, Rademacher complexity, comparison classes of experts, and program length are just a few. The term “complexity” to me seems somehow misleading; the terms never capture something that meets my intuitive notion of complexity. The Bayesian notion clearly captures what’s going on. Functions aren’t “complex”– they’re just “surprising”: we assign to them low probability. Most (all?) complexity notions I know boil down to some (generally loose) bound on the prior probability of the function. In a sense, “complexity” fundementally arises because probability distributions must sum to one. You can’t believe in all possibilities at the same time, or at least not equally. Rather you have to carefully spread the probability mass over the options you’d like to consider. Large complexity classes means that beliefs are spread thinly. In


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One of the central concerns of learning is to understand and to prevent overfitting. [sent-1, score-0.266]

2 Various notion of “function complexity” often arise: VC dimension, Rademacher complexity, comparison classes of experts, and program length are just a few. [sent-2, score-0.658]

3 The term “complexity” to me seems somehow misleading; the terms never capture something that meets my intuitive notion of complexity. [sent-3, score-0.63]

4 The Bayesian notion clearly captures what’s going on. [sent-4, score-0.37]

5 Functions aren’t “complex”– they’re just “surprising”: we assign to them low probability. [sent-5, score-0.094]

6 ) complexity notions I know boil down to some (generally loose) bound on the prior probability of the function. [sent-7, score-1.139]

7 In a sense, “complexity” fundementally arises because probability distributions must sum to one. [sent-8, score-0.788]

8 You can’t believe in all possibilities at the same time, or at least not equally. [sent-9, score-0.084]

9 Rather you have to carefully spread the probability mass over the options you’d like to consider. [sent-10, score-0.591]

10 Large complexity classes means that beliefs are spread thinly. [sent-11, score-0.774]

11 In it’s simplest form, this phenomenom give the log (1\n) for n hypotheses in classic PAC bounds. [sent-12, score-0.292]

12 In fact, one way to think about good learning algorithms is that they are those which take full advantage of their probability mass. [sent-13, score-0.281]

13 In the language of Minimum Description Length, they correspond to “non-defective distributions”. [sent-14, score-0.094]

14 So this raises a question: are there notions of complexity (preferably finite, computable ones) that differ fundementally from the notions of “prior” or “surprisingness”? [sent-15, score-1.541]

15 Game-theoretic setups would seem to be promising, although much of the work I’m familiar with ties it closely to the notion of prior as well. [sent-16, score-0.625]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('complexity', 0.327), ('notions', 0.318), ('fundementally', 0.273), ('notion', 0.249), ('probability', 0.209), ('spread', 0.202), ('length', 0.181), ('prior', 0.164), ('classes', 0.154), ('distributions', 0.151), ('ties', 0.121), ('boil', 0.121), ('captures', 0.121), ('classic', 0.121), ('raises', 0.121), ('rademacher', 0.112), ('meets', 0.112), ('promising', 0.112), ('computable', 0.106), ('vc', 0.101), ('loose', 0.101), ('intuitive', 0.097), ('hypotheses', 0.097), ('misleading', 0.097), ('prevent', 0.097), ('correspond', 0.094), ('somehow', 0.094), ('dimension', 0.094), ('options', 0.094), ('assign', 0.094), ('closely', 0.091), ('beliefs', 0.091), ('central', 0.091), ('pac', 0.088), ('arise', 0.088), ('mass', 0.086), ('possibilities', 0.084), ('surprising', 0.084), ('arises', 0.082), ('re', 0.08), ('differ', 0.078), ('concerns', 0.078), ('capture', 0.078), ('comparison', 0.074), ('description', 0.074), ('simplest', 0.074), ('sum', 0.073), ('minimum', 0.073), ('full', 0.072), ('finite', 0.071)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 123 hunch net-2005-10-16-Complexity: It’s all in your head

Introduction: One of the central concerns of learning is to understand and to prevent overfitting. Various notion of “function complexity” often arise: VC dimension, Rademacher complexity, comparison classes of experts, and program length are just a few. The term “complexity” to me seems somehow misleading; the terms never capture something that meets my intuitive notion of complexity. The Bayesian notion clearly captures what’s going on. Functions aren’t “complex”– they’re just “surprising”: we assign to them low probability. Most (all?) complexity notions I know boil down to some (generally loose) bound on the prior probability of the function. In a sense, “complexity” fundementally arises because probability distributions must sum to one. You can’t believe in all possibilities at the same time, or at least not equally. Rather you have to carefully spread the probability mass over the options you’d like to consider. Large complexity classes means that beliefs are spread thinly. In

2 0.17226627 33 hunch net-2005-02-28-Regularization

Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s

3 0.1630365 5 hunch net-2005-01-26-Watchword: Probability

Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in

4 0.12566254 34 hunch net-2005-03-02-Prior, “Prior” and Bias

Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt

5 0.11344875 288 hunch net-2008-02-10-Complexity Illness

Introduction: One of the enduring stereotypes of academia is that people spend a great deal of intelligence, time, and effort finding complexity rather than simplicity. This is at least anecdotally true in my experience. Math++ Several people have found that adding useless math makes their paper more publishable as evidenced by a reject-add-accept sequence. 8 page minimum Who submitted a paper to ICML violating the 8 page minimum? Every author fears that the reviewers won’t take their work seriously unless the allowed length is fully used. The best minimum violation I know is Adam ‘s paper at SODA on generating random factored numbers , but this is deeply exceptional. It’s a fair bet that 90% of papers submitted are exactly at the page limit. We could imagine that this is because papers naturally take more space, but few people seem to be clamoring for more space. Journalong Has anyone been asked to review a 100 page journal paper? I have. Journal papers can be nice, becaus

6 0.1076713 235 hunch net-2007-03-03-All Models of Learning have Flaws

7 0.094669417 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

8 0.08609999 351 hunch net-2009-05-02-Wielding a New Abstraction

9 0.084840328 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction

10 0.082751662 445 hunch net-2011-09-28-Somebody’s Eating Your Lunch

11 0.081803173 368 hunch net-2009-08-26-Another 10-year paper in Machine Learning

12 0.079988249 133 hunch net-2005-11-28-A question of quantification

13 0.077608705 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

14 0.077574931 165 hunch net-2006-03-23-The Approximation Argument

15 0.077412575 12 hunch net-2005-02-03-Learning Theory, by assumption

16 0.07722237 95 hunch net-2005-07-14-What Learning Theory might do

17 0.076653667 49 hunch net-2005-03-30-What can Type Theory teach us about Machine Learning?

18 0.07563711 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

19 0.075045682 14 hunch net-2005-02-07-The State of the Reduction

20 0.074964769 325 hunch net-2008-11-10-ICML Reviewing Criteria


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.156), (1, 0.085), (2, 0.039), (3, -0.006), (4, -0.017), (5, -0.02), (6, 0.017), (7, 0.056), (8, 0.096), (9, -0.036), (10, 0.038), (11, -0.04), (12, 0.052), (13, -0.041), (14, 0.07), (15, -0.115), (16, -0.033), (17, 0.001), (18, 0.093), (19, -0.018), (20, 0.01), (21, 0.098), (22, 0.047), (23, 0.053), (24, -0.016), (25, -0.0), (26, 0.012), (27, -0.029), (28, -0.034), (29, 0.098), (30, 0.069), (31, -0.011), (32, 0.059), (33, 0.041), (34, -0.1), (35, -0.068), (36, 0.042), (37, 0.038), (38, 0.041), (39, 0.055), (40, -0.066), (41, -0.02), (42, 0.057), (43, 0.006), (44, -0.003), (45, 0.048), (46, -0.028), (47, 0.01), (48, -0.077), (49, 0.135)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98533022 123 hunch net-2005-10-16-Complexity: It’s all in your head

Introduction: One of the central concerns of learning is to understand and to prevent overfitting. Various notion of “function complexity” often arise: VC dimension, Rademacher complexity, comparison classes of experts, and program length are just a few. The term “complexity” to me seems somehow misleading; the terms never capture something that meets my intuitive notion of complexity. The Bayesian notion clearly captures what’s going on. Functions aren’t “complex”– they’re just “surprising”: we assign to them low probability. Most (all?) complexity notions I know boil down to some (generally loose) bound on the prior probability of the function. In a sense, “complexity” fundementally arises because probability distributions must sum to one. You can’t believe in all possibilities at the same time, or at least not equally. Rather you have to carefully spread the probability mass over the options you’d like to consider. Large complexity classes means that beliefs are spread thinly. In

2 0.7870363 5 hunch net-2005-01-26-Watchword: Probability

Introduction: Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y . The probability is then Y/(Y+1) . Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in

3 0.7331689 33 hunch net-2005-02-28-Regularization

Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s

4 0.68971443 39 hunch net-2005-03-10-Breaking Abstractions

Introduction: Sam Roweis ‘s comment reminds me of a more general issue that comes up in doing research: abstractions always break. Real number’s aren’t. Most real numbers can not be represented with any machine. One implication of this is that many real-number based algorithms have difficulties when implemented with floating point numbers. The box on your desk is not a turing machine. A turing machine can compute anything computable, given sufficient time. A typical computer fails terribly when the state required for the computation exceeds some limit. Nash equilibria aren’t equilibria. This comes up when trying to predict human behavior based on the result of the equilibria computation. Often, it doesn’t work. The probability isn’t. Probability is an abstraction expressing either our lack of knowledge (the Bayesian viewpoint) or fundamental randomization (the frequentist viewpoint). From the frequentist viewpoint the lack of knowledge typically precludes actually knowing the fu

5 0.59981585 34 hunch net-2005-03-02-Prior, “Prior” and Bias

Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt

6 0.59734684 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting

7 0.58081776 62 hunch net-2005-04-26-To calibrate or not?

8 0.55055481 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

9 0.5371564 118 hunch net-2005-10-07-On-line learning of regular decision rules

10 0.5143851 218 hunch net-2006-11-20-Context and the calculation misperception

11 0.51128489 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

12 0.48632535 185 hunch net-2006-06-16-Regularization = Robustness

13 0.48612261 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

14 0.48404714 140 hunch net-2005-12-14-More NIPS Papers II

15 0.47259134 330 hunch net-2008-12-07-A NIPS paper

16 0.45921981 302 hunch net-2008-05-25-Inappropriate Mathematics for Machine Learning

17 0.4583441 160 hunch net-2006-03-02-Why do people count for learning?

18 0.44900832 413 hunch net-2010-10-08-An easy proof of the Chernoff-Hoeffding bound

19 0.44247499 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

20 0.43878752 165 hunch net-2006-03-23-The Approximation Argument


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.043), (10, 0.033), (27, 0.22), (38, 0.071), (53, 0.018), (55, 0.079), (69, 0.287), (94, 0.075), (95, 0.077)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.90022302 123 hunch net-2005-10-16-Complexity: It’s all in your head

Introduction: One of the central concerns of learning is to understand and to prevent overfitting. Various notion of “function complexity” often arise: VC dimension, Rademacher complexity, comparison classes of experts, and program length are just a few. The term “complexity” to me seems somehow misleading; the terms never capture something that meets my intuitive notion of complexity. The Bayesian notion clearly captures what’s going on. Functions aren’t “complex”– they’re just “surprising”: we assign to them low probability. Most (all?) complexity notions I know boil down to some (generally loose) bound on the prior probability of the function. In a sense, “complexity” fundementally arises because probability distributions must sum to one. You can’t believe in all possibilities at the same time, or at least not equally. Rather you have to carefully spread the probability mass over the options you’d like to consider. Large complexity classes means that beliefs are spread thinly. In

2 0.66474414 360 hunch net-2009-06-15-In Active Learning, the question changes

Introduction: A little over 4 years ago, Sanjoy made a post saying roughly “we should study active learning theoretically, because not much is understood”. At the time, we did not understand basic things such as whether or not it was possible to PAC-learn with an active algorithm without making strong assumptions about the noise rate. In other words, the fundamental question was “can we do it?” The nature of the question has fundamentally changed in my mind. The answer is to the previous question is “yes”, both information theoretically and computationally, most places where supervised learning could be applied. In many situation, the question has now changed to: “is it worth it?” Is the programming and computational overhead low enough to make the label cost savings of active learning worthwhile? Currently, there are situations where this question could go either way. Much of the challenge for the future is in figuring out how to make active learning easier or more worthwhile.

3 0.66415596 343 hunch net-2009-02-18-Decision by Vetocracy

Introduction: Few would mistake the process of academic paper review for a fair process, but sometimes the unfairness seems particularly striking. This is most easily seen by comparison: Paper Banditron Offset Tree Notes Problem Scope Multiclass problems where only the loss of one choice can be probed. Strictly greater: Cost sensitive multiclass problems where only the loss of one choice can be probed. Often generalizations don’t matter. That’s not the case here, since every plausible application I’ve thought of involves loss functions substantially different from 0/1. What’s new Analysis and Experiments Algorithm, Analysis, and Experiments As far as I know, the essence of the more general problem was first stated and analyzed with the EXP4 algorithm (page 16) (1998). It’s also the time horizon 1 simplification of the Reinforcement Learning setting for the random trajectory method (page 15) (2002). The Banditron algorithm itself is functionally identi

4 0.66204464 36 hunch net-2005-03-05-Funding Research

Introduction: The funding of research (and machine learning research) is an issue which seems to have become more significant in the United States over the last decade. The word “research” is applied broadly here to science, mathematics, and engineering. There are two essential difficulties with funding research: Longshot Paying a researcher is often a big gamble. Most research projects don’t pan out, but a few big payoffs can make it all worthwhile. Information Only Much of research is about finding the right way to think about or do something. The Longshot difficulty means that there is high variance in payoffs. This can be compensated for by funding many different research projects, reducing variance. The Information-Only difficulty means that it’s hard to extract a profit directly from many types of research, so companies have difficulty justifying basic research. (Patents are a mechanism for doing this. They are often extraordinarily clumsy or simply not applicable.) T

5 0.65833962 194 hunch net-2006-07-11-New Models

Introduction: How should we, as researchers in machine learning, organize ourselves? The most immediate measurable objective of computer science research is publishing a paper. The most difficult aspect of publishing a paper is having reviewers accept and recommend it for publication. The simplest mechanism for doing this is to show theoretical progress on some standard, well-known easily understood problem. In doing this, we often fall into a local minima of the research process. The basic problem in machine learning is that it is very unclear that the mathematical model is the right one for the (or some) real problem. A good mathematical model in machine learning should have one fundamental trait: it should aid the design of effective learning algorithms. To date, our ability to solve interesting learning problems (speech recognition, machine translation, object recognition, etc…) remains limited (although improving), so the “rightness” of our models is in doubt. If our mathematical mod

6 0.655864 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

7 0.65528399 235 hunch net-2007-03-03-All Models of Learning have Flaws

8 0.65371758 406 hunch net-2010-08-22-KDD 2010

9 0.65360808 95 hunch net-2005-07-14-What Learning Theory might do

10 0.64950079 51 hunch net-2005-04-01-The Producer-Consumer Model of Research

11 0.64945149 320 hunch net-2008-10-14-Who is Responsible for a Bad Review?

12 0.64887959 371 hunch net-2009-09-21-Netflix finishes (and starts)

13 0.64861834 351 hunch net-2009-05-02-Wielding a New Abstraction

14 0.64856482 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

15 0.64832062 259 hunch net-2007-08-19-Choice of Metrics

16 0.64765221 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

17 0.64684385 183 hunch net-2006-06-14-Explorations of Exploration

18 0.64607155 432 hunch net-2011-04-20-The End of the Beginning of Active Learning

19 0.64508665 12 hunch net-2005-02-03-Learning Theory, by assumption

20 0.64481235 220 hunch net-2006-11-27-Continuizing Solutions