hunch_net hunch_net-2010 hunch_net-2010-413 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Textbooks invariably seem to carry the proof that uses Markov’s inequality, moment-generating functions, and Taylor approximations. Here’s an easier way. For , let be the KL divergence between a coin of bias and one of bias : Theorem: Suppose you do independent tosses of a coin of bias . The probability of seeing heads or more, for , is at most . So is the probability of seeing heads or less, for . Remark: By Pinsker’s inequality, . Proof Let’s do the case; the other is identical. Let be the distribution over induced by a coin of bias , and likewise for a coin of bias . Let be the set of all sequences of tosses which contain heads or more. We’d like to show that is unlikely under . Pick any , with say heads. Then: Since for every , we have and we’re done.
sentIndex sentText sentNum sentScore
1 Textbooks invariably seem to carry the proof that uses Markov’s inequality, moment-generating functions, and Taylor approximations. [sent-1, score-0.491]
2 For , let be the KL divergence between a coin of bias and one of bias : Theorem: Suppose you do independent tosses of a coin of bias . [sent-3, score-2.75]
3 The probability of seeing heads or more, for , is at most . [sent-4, score-0.74]
4 So is the probability of seeing heads or less, for . [sent-5, score-0.74]
5 Let be the distribution over induced by a coin of bias , and likewise for a coin of bias . [sent-8, score-1.94]
6 Let be the set of all sequences of tosses which contain heads or more. [sent-9, score-0.966]
7 Then: Since for every , we have and we’re done. [sent-12, score-0.047]
wordName wordTfidf (topN-words)
[('coin', 0.481), ('heads', 0.439), ('bias', 0.352), ('tosses', 0.292), ('let', 0.239), ('inequality', 0.208), ('seeing', 0.189), ('proof', 0.141), ('invariably', 0.13), ('textbooks', 0.13), ('carry', 0.12), ('likewise', 0.12), ('kl', 0.114), ('divergence', 0.114), ('probability', 0.112), ('sequences', 0.104), ('induced', 0.1), ('contain', 0.097), ('unlikely', 0.097), ('re', 0.086), ('markov', 0.077), ('pick', 0.077), ('suppose', 0.076), ('independent', 0.072), ('show', 0.069), ('functions', 0.064), ('easier', 0.063), ('theorem', 0.062), ('uses', 0.058), ('say', 0.058), ('distribution', 0.054), ('every', 0.047), ('case', 0.046), ('done', 0.044), ('less', 0.044), ('seem', 0.042), ('since', 0.037), ('set', 0.034), ('like', 0.026), ('one', 0.015)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 413 hunch net-2010-10-08-An easy proof of the Chernoff-Hoeffding bound
Introduction: Textbooks invariably seem to carry the proof that uses Markov’s inequality, moment-generating functions, and Taylor approximations. Here’s an easier way. For , let be the KL divergence between a coin of bias and one of bias : Theorem: Suppose you do independent tosses of a coin of bias . The probability of seeing heads or more, for , is at most . So is the probability of seeing heads or less, for . Remark: By Pinsker’s inequality, . Proof Let’s do the case; the other is identical. Let be the distribution over induced by a coin of bias , and likewise for a coin of bias . Let be the set of all sequences of tosses which contain heads or more. We’d like to show that is unlikely under . Pick any , with say heads. Then: Since for every , we have and we’re done.
2 0.13040477 90 hunch net-2005-07-07-The Limits of Learning Theory
Introduction: Suppose we had an infinitely powerful mathematician sitting in a room and proving theorems about learning. Could he solve machine learning? The answer is “no”. This answer is both obvious and sometimes underappreciated. There are several ways to conclude that some bias is necessary in order to succesfully learn. For example, suppose we are trying to solve classification. At prediction time, we observe some features X and want to make a prediction of either 0 or 1 . Bias is what makes us prefer one answer over the other based on past experience. In order to learn we must: Have a bias. Always predicting 0 is as likely as 1 is useless. Have the “right” bias. Predicting 1 when the answer is 0 is also not helpful. The implication of “have a bias” is that we can not design effective learning algorithms with “a uniform prior over all possibilities”. The implication of “have the ‘right’ bias” is that our mathematician fails since “right” is defined wi
3 0.10313512 104 hunch net-2005-08-22-Do you believe in induction?
Introduction: Foster Provost gave a talk at the ICML metalearning workshop on “metalearning” and the “no free lunch theorem” which seems worth summarizing. As a review: the no free lunch theorem is the most complicated way we know of to say that a bias is required in order to learn. The simplest way to see this is in a nonprobabilistic setting. If you are given examples of the form (x,y) and you wish to predict y from x then any prediction mechanism errs half the time in expectation over all sequences of examples. The proof of this is very simple: on every example a predictor must make some prediction and by symmetry over the set of sequences it will be wrong half the time and right half the time. The basic idea of this proof has been applied to many other settings. The simplistic interpretation of this theorem which many people jump to is “machine learning is dead” since there can be no single learning algorithm which can solve all learning problems. This is the wrong way to thi
4 0.099200279 34 hunch net-2005-03-02-Prior, “Prior” and Bias
Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt
5 0.06590756 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
6 0.065905735 26 hunch net-2005-02-21-Problem: Cross Validation
7 0.065768585 8 hunch net-2005-02-01-NIPS: Online Bayes
8 0.065442502 187 hunch net-2006-06-25-Presentation of Proofs is Hard.
9 0.056388721 7 hunch net-2005-01-31-Watchword: Assumption
10 0.050944664 5 hunch net-2005-01-26-Watchword: Probability
11 0.05080907 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
12 0.05025686 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning
13 0.049068186 334 hunch net-2009-01-07-Interesting Papers at SODA 2009
14 0.049044903 133 hunch net-2005-11-28-A question of quantification
15 0.046316426 16 hunch net-2005-02-09-Intuitions from applied learning
16 0.045940615 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets
17 0.044942077 258 hunch net-2007-08-12-Exponentiated Gradient
18 0.043656532 149 hunch net-2006-01-18-Is Multitask Learning Black-Boxable?
19 0.042312156 485 hunch net-2013-06-29-The Benefits of Double-Blind Review
20 0.04163909 409 hunch net-2010-09-13-AIStats
topicId topicWeight
[(0, 0.064), (1, 0.04), (2, 0.037), (3, 0.002), (4, 0.004), (5, -0.017), (6, 0.021), (7, 0.014), (8, 0.048), (9, -0.044), (10, 0.008), (11, -0.006), (12, 0.067), (13, -0.03), (14, 0.056), (15, -0.008), (16, -0.047), (17, -0.006), (18, 0.015), (19, 0.021), (20, -0.031), (21, -0.008), (22, 0.067), (23, -0.036), (24, 0.009), (25, 0.0), (26, 0.05), (27, -0.017), (28, -0.056), (29, 0.012), (30, 0.024), (31, -0.01), (32, -0.043), (33, 0.029), (34, -0.054), (35, -0.045), (36, -0.051), (37, 0.086), (38, 0.018), (39, -0.037), (40, -0.059), (41, 0.0), (42, -0.095), (43, -0.034), (44, -0.013), (45, -0.046), (46, 0.023), (47, 0.121), (48, -0.065), (49, -0.008)]
simIndex simValue blogId blogTitle
same-blog 1 0.99048465 413 hunch net-2010-10-08-An easy proof of the Chernoff-Hoeffding bound
Introduction: Textbooks invariably seem to carry the proof that uses Markov’s inequality, moment-generating functions, and Taylor approximations. Here’s an easier way. For , let be the KL divergence between a coin of bias and one of bias : Theorem: Suppose you do independent tosses of a coin of bias . The probability of seeing heads or more, for , is at most . So is the probability of seeing heads or less, for . Remark: By Pinsker’s inequality, . Proof Let’s do the case; the other is identical. Let be the distribution over induced by a coin of bias , and likewise for a coin of bias . Let be the set of all sequences of tosses which contain heads or more. We’d like to show that is unlikely under . Pick any , with say heads. Then: Since for every , we have and we’re done.
2 0.52211827 34 hunch net-2005-03-02-Prior, “Prior” and Bias
Introduction: Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another. This only scratches the surface—there are yet more subt
3 0.50787073 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
Introduction: This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is. Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example. When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing: Pr’(label) * Product features i Pr’(feature i | label) (The Pr’ notation indicates these are estimated values.) There is nothing wrong with this method as long as (a) the prior for the sample counts is
4 0.50594842 62 hunch net-2005-04-26-To calibrate or not?
Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b
5 0.49173436 104 hunch net-2005-08-22-Do you believe in induction?
Introduction: Foster Provost gave a talk at the ICML metalearning workshop on “metalearning” and the “no free lunch theorem” which seems worth summarizing. As a review: the no free lunch theorem is the most complicated way we know of to say that a bias is required in order to learn. The simplest way to see this is in a nonprobabilistic setting. If you are given examples of the form (x,y) and you wish to predict y from x then any prediction mechanism errs half the time in expectation over all sequences of examples. The proof of this is very simple: on every example a predictor must make some prediction and by symmetry over the set of sequences it will be wrong half the time and right half the time. The basic idea of this proof has been applied to many other settings. The simplistic interpretation of this theorem which many people jump to is “machine learning is dead” since there can be no single learning algorithm which can solve all learning problems. This is the wrong way to thi
6 0.48763195 7 hunch net-2005-01-31-Watchword: Assumption
7 0.46890059 90 hunch net-2005-07-07-The Limits of Learning Theory
8 0.46440774 133 hunch net-2005-11-28-A question of quantification
9 0.45221737 302 hunch net-2008-05-25-Inappropriate Mathematics for Machine Learning
10 0.43727225 55 hunch net-2005-04-10-Is the Goal Understanding or Prediction?
11 0.43641642 160 hunch net-2006-03-02-Why do people count for learning?
12 0.42357495 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets
13 0.41709659 57 hunch net-2005-04-16-Which Assumptions are Reasonable?
14 0.41189522 5 hunch net-2005-01-26-Watchword: Probability
15 0.3951036 187 hunch net-2006-06-25-Presentation of Proofs is Hard.
16 0.38572964 202 hunch net-2006-08-10-Precision is not accuracy
17 0.38303635 70 hunch net-2005-05-12-Math on the Web
18 0.37604076 185 hunch net-2006-06-16-Regularization = Robustness
19 0.37547934 218 hunch net-2006-11-20-Context and the calculation misperception
20 0.37189224 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
topicId topicWeight
[(0, 0.02), (26, 0.581), (27, 0.104), (38, 0.071), (94, 0.059)]
simIndex simValue blogId blogTitle
same-blog 1 0.94584155 413 hunch net-2010-10-08-An easy proof of the Chernoff-Hoeffding bound
Introduction: Textbooks invariably seem to carry the proof that uses Markov’s inequality, moment-generating functions, and Taylor approximations. Here’s an easier way. For , let be the KL divergence between a coin of bias and one of bias : Theorem: Suppose you do independent tosses of a coin of bias . The probability of seeing heads or more, for , is at most . So is the probability of seeing heads or less, for . Remark: By Pinsker’s inequality, . Proof Let’s do the case; the other is identical. Let be the distribution over induced by a coin of bias , and likewise for a coin of bias . Let be the set of all sequences of tosses which contain heads or more. We’d like to show that is unlikely under . Pick any , with say heads. Then: Since for every , we have and we’re done.
2 0.92968917 171 hunch net-2006-04-09-Progress in Machine Translation
Introduction: I just visited ISI where Daniel Marcu and others are working on machine translation. Apparently, machine translation is rapidly improving. A particularly dramatic year was 2002->2003 when systems switched from word-based translation to phrase-based translation. From a (now famous) slide by Charles Wayne at DARPA (which funds much of the work on machine translation) here is some anecdotal evidence: 2002 2003 insistent Wednesday may recurred her trips to Libya tomorrow for flying. Cairo 6-4 ( AFP ) – An official announced today in the Egyptian lines company for flying Tuesday is a company “insistent for flying” may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment. And said the official “the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air, a situation her recieving replying are so a trip will pull to Libya a morning Wednesday.” E
3 0.74353492 17 hunch net-2005-02-10-Conferences, Dates, Locations
Introduction: Conference Locate Date COLT Bertinoro, Italy June 27-30 AAAI Pittsburgh, PA, USA July 9-13 UAI Edinburgh, Scotland July 26-29 IJCAI Edinburgh, Scotland July 30 – August 5 ICML Bonn, Germany August 7-11 KDD Chicago, IL, USA August 21-24 The big winner this year is Europe. This is partly a coincidence, and partly due to the general internationalization of science over the last few years. With cuts to basic science in the US and increased hassle for visitors, conferences outside the US become more attractive. Europe and Australia/New Zealand are the immediate winners because they have the science, infrastructure, and english in place. China and India are possible future winners.
4 0.58357942 97 hunch net-2005-07-23-Interesting papers at ACL
Introduction: A recent discussion indicated that one goal of this blog might be to allow people to post comments about recent papers that they liked. I think this could potentially be very useful, especially for those with diverse interests but only finite time to read through conference proceedings. ACL 2005 recently completed, and here are four papers from that conference that I thought were either good or perhaps of interest to a machine learning audience. David Chiang, A Hierarchical Phrase-Based Model for Statistical Machine Translation . (Best paper award.) This paper takes the standard phrase-based MT model that is popular in our field (basically, translate a sentence by individually translating phrases and reordering them according to a complicated statistical model) and extends it to take into account hierarchy in phrases, so that you can learn things like “X ‘s Y” -> “Y de X” in chinese, where X and Y are arbitrary phrases. This takes a step toward linguistic syntax for MT, whic
5 0.57871807 305 hunch net-2008-06-30-ICML has a comment system
Introduction: Mark Reid has stepped up and created a comment system for ICML papers which Greger Linden has tightly integrated. My understanding is that Mark spent quite a bit of time on the details, and there are some cool features like working latex math mode. This is an excellent chance for the ICML community to experiment with making ICML year-round, so I hope it works out. Please do consider experimenting with it.
6 0.50242537 25 hunch net-2005-02-20-At One Month
7 0.4381952 43 hunch net-2005-03-18-Binomial Weighting
8 0.22470853 160 hunch net-2006-03-02-Why do people count for learning?
9 0.220781 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)
10 0.21758455 251 hunch net-2007-06-24-Interesting Papers at ICML 2007
11 0.21569277 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions
12 0.21276566 353 hunch net-2009-05-08-Computability in Artificial Intelligence
13 0.21157792 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions
14 0.21139011 233 hunch net-2007-02-16-The Forgetting
15 0.21131918 14 hunch net-2005-02-07-The State of the Reduction
16 0.20993428 202 hunch net-2006-08-10-Precision is not accuracy
17 0.20864658 133 hunch net-2005-11-28-A question of quantification
18 0.2060539 259 hunch net-2007-08-19-Choice of Metrics
19 0.2039084 170 hunch net-2006-04-06-Bounds greater than 1
20 0.20371997 143 hunch net-2005-12-27-Automated Labeling