hunch_net hunch_net-2008 hunch_net-2008-303 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time. The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding: N = E x ~ D f(x) given samples from D and knowledge of f . Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q . In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula: E x ~ Q f(x) D(x)/Q(x) A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions. Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1 , which is typically substantially better than the sample complexity of the orig
sentIndex sentText sentNum sentScore
1 This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time. [sent-1, score-0.482]
2 The basic trick has to do with importance weighting for monte carlo integration. [sent-2, score-1.289]
3 Consider the problem of finding: N = E x ~ D f(x) given samples from D and knowledge of f . [sent-3, score-0.581]
4 Often, we don’t have samples from D available. [sent-4, score-0.34]
5 Instead, we must make do with samples from some other distribution Q . [sent-5, score-0.412]
6 In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula: E x ~ Q f(x) D(x)/Q(x) A basic question is: How many samples from Q are required in order to estimate N to some precision? [sent-6, score-1.584]
7 In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions. [sent-7, score-0.484]
8 Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1 , which is typically substantially better than the sample complexity of the original problem. [sent-8, score-1.004]
9 This observation underlies the motivation for voluntary importance weighting algorithms. [sent-9, score-1.03]
10 Even under pretty terrible approximations, the logic of “ Q(x) is something like f(x) D(x) ” often yields substantial improvements over sampling directly from D(x) . [sent-10, score-0.814]
wordName wordTfidf (topN-words)
[('weighting', 0.384), ('samples', 0.34), ('importance', 0.286), ('trick', 0.231), ('bounded', 0.217), ('voluntary', 0.171), ('dale', 0.171), ('schuurmans', 0.171), ('nonzero', 0.158), ('sample', 0.156), ('carlo', 0.149), ('monte', 0.149), ('formula', 0.137), ('approximations', 0.137), ('terrible', 0.132), ('complexity', 0.132), ('precision', 0.128), ('logic', 0.115), ('repeatedly', 0.111), ('motivation', 0.111), ('convergence', 0.108), ('special', 0.108), ('yields', 0.108), ('often', 0.106), ('original', 0.103), ('sampling', 0.103), ('estimate', 0.101), ('improvements', 0.095), ('given', 0.095), ('basic', 0.09), ('turns', 0.086), ('knowledge', 0.084), ('finding', 0.078), ('observation', 0.078), ('required', 0.078), ('pretty', 0.078), ('directly', 0.077), ('learned', 0.077), ('nevertheless', 0.072), ('distribution', 0.072), ('still', 0.07), ('substantially', 0.07), ('order', 0.068), ('instead', 0.067), ('rate', 0.064), ('post', 0.063), ('problem', 0.062), ('value', 0.061), ('long', 0.061), ('case', 0.061)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999994 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting
Introduction: This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time. The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding: N = E x ~ D f(x) given samples from D and knowledge of f . Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q . In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula: E x ~ Q f(x) D(x)/Q(x) A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions. Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1 , which is typically substantially better than the sample complexity of the orig
2 0.11430052 33 hunch net-2005-02-28-Regularization
Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s
3 0.10415353 183 hunch net-2006-06-14-Explorations of Exploration
Introduction: Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models. Reinforcement Learning (1) . Reinforcement learning has traditionally focused on Markov Decision Processes where the next state s’ is given by a conditional distribution P(s’|s,a) given the current state s and action a . The typical result here is that certain specific algorithms controlling an agent can behave within e of optimal for horizon T except for poly(1/e,T,S,A) “wasted” experiences (with high probability). This started with E 3 by Satinder Singh and Michael Kearns . Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E 3 and
4 0.10065021 235 hunch net-2007-03-03-All Models of Learning have Flaws
Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c
5 0.098429978 251 hunch net-2007-06-24-Interesting Papers at ICML 2007
Introduction: Here are a few of the papers I enjoyed at ICML. Steffen Bickel , Michael Brüeckner, Tobias Scheffer , Discriminative Learning for Differing Training and Test Distributions There is a nice trick in this paper: they predict the probability that an unlabeled sample is in the training set vs. the test set, and then use this prediction to importance weight labeled samples in the training set. This paper uses a specific parametric model, but the approach is easily generalized. Steve Hanneke A Bound on the Label Complexity of Agnostic Active Learning This paper bounds the number of labels required by the A 2 algorithm for active learning in the agnostic case. Last year we figured out agnostic active learning was possible. This year, it’s quantified. Hopefull soon, it will be practical. Sylvian Gelly , David Silver Combining Online and Offline Knowledge in UCT . This paper is about techniques for improving MoGo with various sorts of learning. MoGo has a fair
6 0.097199038 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous
7 0.095500603 43 hunch net-2005-03-18-Binomial Weighting
8 0.095207959 45 hunch net-2005-03-22-Active learning
9 0.094988473 256 hunch net-2007-07-20-Motivation should be the Responsibility of the Reviewer
10 0.093433246 28 hunch net-2005-02-25-Problem: Online Learning
11 0.092878126 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
12 0.092453055 351 hunch net-2009-05-02-Wielding a New Abstraction
13 0.090470031 98 hunch net-2005-07-27-Not goal metrics
14 0.083494075 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
15 0.075436428 160 hunch net-2006-03-02-Why do people count for learning?
16 0.075436346 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
17 0.074910313 14 hunch net-2005-02-07-The State of the Reduction
18 0.074480414 432 hunch net-2011-04-20-The End of the Beginning of Active Learning
19 0.072248876 120 hunch net-2005-10-10-Predictive Search is Coming
20 0.071817815 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
topicId topicWeight
[(0, 0.153), (1, 0.083), (2, 0.029), (3, 0.01), (4, 0.013), (5, -0.033), (6, 0.05), (7, 0.015), (8, 0.014), (9, -0.014), (10, 0.014), (11, 0.0), (12, 0.007), (13, 0.006), (14, -0.007), (15, -0.034), (16, -0.06), (17, 0.012), (18, -0.009), (19, 0.047), (20, -0.022), (21, 0.092), (22, 0.039), (23, 0.045), (24, 0.015), (25, -0.005), (26, -0.039), (27, -0.02), (28, 0.015), (29, -0.07), (30, -0.002), (31, -0.034), (32, 0.054), (33, 0.069), (34, -0.04), (35, -0.054), (36, 0.062), (37, 0.041), (38, -0.078), (39, 0.044), (40, -0.053), (41, -0.002), (42, 0.075), (43, -0.027), (44, 0.021), (45, 0.061), (46, -0.09), (47, -0.014), (48, -0.023), (49, 0.05)]
simIndex simValue blogId blogTitle
same-blog 1 0.97546071 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting
Introduction: This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time. The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding: N = E x ~ D f(x) given samples from D and knowledge of f . Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q . In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula: E x ~ Q f(x) D(x)/Q(x) A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions. Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1 , which is typically substantially better than the sample complexity of the orig
2 0.73730153 33 hunch net-2005-02-28-Regularization
Introduction: Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints. Functionally . Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l 1 or l 2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data. Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S . There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S , e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a s
3 0.63753694 123 hunch net-2005-10-16-Complexity: It’s all in your head
Introduction: One of the central concerns of learning is to understand and to prevent overfitting. Various notion of “function complexity” often arise: VC dimension, Rademacher complexity, comparison classes of experts, and program length are just a few. The term “complexity” to me seems somehow misleading; the terms never capture something that meets my intuitive notion of complexity. The Bayesian notion clearly captures what’s going on. Functions aren’t “complex”– they’re just “surprising”: we assign to them low probability. Most (all?) complexity notions I know boil down to some (generally loose) bound on the prior probability of the function. In a sense, “complexity” fundementally arises because probability distributions must sum to one. You can’t believe in all possibilities at the same time, or at least not equally. Rather you have to carefully spread the probability mass over the options you’d like to consider. Large complexity classes means that beliefs are spread thinly. In
4 0.59311789 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design
Introduction: There were two papers at ICML presenting learning algorithms for a contextual bandit -style setting, where the loss for all labels is not known, but the loss for one label is known. (The first might require a exploration scavenging viewpoint to understand if the experimental assignment was nonrandom.) I strongly approve of these papers and further work in this setting and its variants, because I expect it to become more important than supervised learning. As a quick review, we are thinking about situations where repeatedly: The world reveals feature values (aka context information). A policy chooses an action. The world provides a reward. Sometimes this is done in an online fashion where the policy can change based on immediate feedback and sometimes it’s done in a batch setting where many samples are collected before the policy can change. If you haven’t spent time thinking about the setting, you might want to because there are many natural applications. I’m g
5 0.5573799 351 hunch net-2009-05-02-Wielding a New Abstraction
Introduction: This post is partly meant as an advertisement for the reductions tutorial Alina , Bianca , and I are planning to do at ICML . Please come, if you are interested. Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems. In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, yo
6 0.55360121 218 hunch net-2006-11-20-Context and the calculation misperception
7 0.54604799 45 hunch net-2005-03-22-Active learning
8 0.53839523 185 hunch net-2006-06-16-Regularization = Robustness
9 0.52584773 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem
10 0.52346665 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models
11 0.51967561 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time
12 0.50583887 98 hunch net-2005-07-27-Not goal metrics
13 0.50466639 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”
14 0.48958498 35 hunch net-2005-03-04-The Big O and Constants in Learning
15 0.48904294 39 hunch net-2005-03-10-Breaking Abstractions
16 0.4886685 67 hunch net-2005-05-06-Don’t mix the solution into the problem
17 0.47985625 231 hunch net-2007-02-10-Best Practices for Collaboration
18 0.47804126 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms
19 0.47296366 176 hunch net-2006-05-01-A conversation between Theo and Pat
20 0.46129265 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem
topicId topicWeight
[(27, 0.319), (48, 0.406), (53, 0.017), (55, 0.08), (94, 0.056)]
simIndex simValue blogId blogTitle
1 0.92494673 232 hunch net-2007-02-11-24
Introduction: To commemorate the Twenty Fourth Annual International Conference on Machine Learning (ICML-07), the FOX Network has decided to launch a new spin-off series in prime time. Through unofficial sources, I have obtained the story arc for the first season, which appears frighteningly realistic.
same-blog 2 0.91400194 303 hunch net-2008-06-09-The Minimum Sample Complexity of Importance Weighting
Introduction: This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time. The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding: N = E x ~ D f(x) given samples from D and knowledge of f . Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q . In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula: E x ~ Q f(x) D(x)/Q(x) A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions. Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1 , which is typically substantially better than the sample complexity of the orig
3 0.85680437 468 hunch net-2012-06-29-ICML survey and comments
Introduction: Just about nothing could keep me from attending ICML , except for Dora who arrived on Monday. Consequently, I have only secondhand reports that the conference is going well. For those who are remote (like me) or after the conference (like everyone), Mark Reid has setup the ICML discussion site where you can comment on any paper or subscribe to papers. Authors are automatically subscribed to their own papers, so it should be possible to have a discussion significantly after the fact, as people desire. We also conducted a survey before the conference and have the survey results now. This can be compared with the ICML 2010 survey results . Looking at the comparable questions, we can sometimes order the answers to have scores ranging from 0 to 3 or 0 to 4 with 3 or 4 being best and 0 worst, then compute the average difference between 2012 and 2010. Glancing through them, I see: Most people found the papers they reviewed a good fit for their expertise (-.037 w.r.t 20
4 0.8440606 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression
Introduction: Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11 ). Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes. In particular, If there exists a set of weight vectors w i such that P(i|x)=, then for any invertible error correcting output code C , there exists weight vectors w c which decode to perfectly predict the probability of each class. The proof is simple and constructive: the weight vector w c can be constructed acc
5 0.83523911 318 hunch net-2008-09-26-The SODA Program Committee
Introduction: Claire asked me to be on the SODA program committee this year, which was quite a bit of work. I had a relatively light load—merely 49 theory papers. Many of these papers were not on subjects that I was expert about, so (as is common for theory conferences) I found various reviewers that I trusted to help review the papers. I ended up reviewing about 1/3 personally. There were a couple instances where I ended up overruling a subreviewer whose logic seemed off, but otherwise I generally let their reviews stand. There are some differences in standards for paper reviews between the machine learning and theory communities. In machine learning it is expected that a review be detailed, while in the theory community this is often not the case. Every paper given to me ended up with a review varying between somewhat and very detailed. I’m sure not every author was happy with the outcome. While we did our best to make good decisions, they were difficult decisions to make. For exam
6 0.79352099 445 hunch net-2011-09-28-Somebody’s Eating Your Lunch
7 0.68753117 46 hunch net-2005-03-24-The Role of Workshops
8 0.6512633 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive
9 0.64045799 466 hunch net-2012-06-05-ICML acceptance statistics
10 0.63614017 259 hunch net-2007-08-19-Choice of Metrics
11 0.6344468 378 hunch net-2009-11-15-The Other Online Learning
12 0.63178408 304 hunch net-2008-06-27-Reviewing Horror Stories
13 0.62993211 325 hunch net-2008-11-10-ICML Reviewing Criteria
14 0.62719256 352 hunch net-2009-05-06-Machine Learning to AI
15 0.62649113 449 hunch net-2011-11-26-Giving Thanks
16 0.62316757 432 hunch net-2011-04-20-The End of the Beginning of Active Learning
17 0.62221277 45 hunch net-2005-03-22-Active learning
18 0.62217939 41 hunch net-2005-03-15-The State of Tight Bounds
19 0.62137622 293 hunch net-2008-03-23-Interactive Machine Learning
20 0.62114602 288 hunch net-2008-02-10-Complexity Illness