hunch_net hunch_net-2010 hunch_net-2010-392 knowledge-graph by maker-knowledge-mining

392 hunch net-2010-03-26-A Variance only Deviation Bound


meta infos for this blog

Source: html

Introduction: At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable . For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance. The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable . [sent-1, score-1.426]

2 For people not familiar with deviation bounds, this may be hard to appreciate. [sent-2, score-0.65]

3 Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. [sent-3, score-0.502]

4 My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. [sent-4, score-0.611]

5 All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. [sent-5, score-1.144]

6 This one is the first I know of with a dependence on only the variance. [sent-6, score-0.239]

7 The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. [sent-7, score-0.55]

8 Then, a deviation bound can be proved by using the exponential moment method, with the sum of the bias and the deviation bounded. [sent-8, score-1.664]

9 The use of a biased estimator is clearly necessary, because an unbiased empirical average is inherently unstable—which was precisely the reason I didn’t think this was possible. [sent-9, score-0.871]

10 Precisely how this is useful for machine learning isn’t clear yet, but it opens up possibilities. [sent-10, score-0.126]

11 For example, it’s common to suffer from large ranges in exploration settings, such as contextual bandits or active learning. [sent-11, score-0.353]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('deviation', 0.582), ('dependence', 0.239), ('estimator', 0.233), ('biased', 0.183), ('variance', 0.169), ('variants', 0.169), ('precisely', 0.156), ('evolved', 0.126), ('developments', 0.126), ('martingale', 0.126), ('believed', 0.126), ('opens', 0.126), ('unstable', 0.116), ('bound', 0.115), ('bounds', 0.112), ('random', 0.111), ('olivier', 0.11), ('hadn', 0.11), ('unbiased', 0.101), ('bandits', 0.101), ('alter', 0.097), ('foundations', 0.097), ('specialized', 0.094), ('components', 0.094), ('suffer', 0.094), ('knew', 0.091), ('described', 0.091), ('week', 0.089), ('understanding', 0.088), ('proved', 0.087), ('range', 0.085), ('contextual', 0.085), ('didn', 0.08), ('earlier', 0.08), ('moment', 0.078), ('variable', 0.078), ('settings', 0.077), ('exponential', 0.077), ('sum', 0.075), ('depending', 0.074), ('exploration', 0.073), ('mean', 0.068), ('bias', 0.068), ('iid', 0.068), ('inherently', 0.068), ('proof', 0.068), ('familiar', 0.068), ('clearly', 0.067), ('basic', 0.066), ('average', 0.063)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999982 392 hunch net-2010-03-26-A Variance only Deviation Bound

Introduction: At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable . For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance. The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by

2 0.18183425 244 hunch net-2007-05-09-The Missing Bound

Introduction: Sham Kakade points out that we are missing a bound. Suppose we have m samples x drawn IID from some distribution D . Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0.5 ) (at least in crude form). A proof is on page 9 here . If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small. What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning,

3 0.17182174 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

Introduction: This post is about contextual bandit problems where, repeatedly: The world chooses features x and rewards for each action r 1 ,…,r k then announces the features x (but not the rewards). A policy chooses an action a . The world announces the reward r a The goal in these situations is to learn a policy which maximizes r a in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction. One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k 0.5 , epoch-greedy (and the simpler epsilon greedy) have a dependence on k 1/3 , and the offset tree has a dependence on k-1 . These results aren’t directly comparable because different things a

4 0.1129866 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

5 0.11131445 41 hunch net-2005-03-15-The State of Tight Bounds

Introduction: What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals. Why? Good Judgement . In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not. Learning Essence . The form of some of these bounds helps you understand what the essence of learning is. Algorithm Design . Some of these bounds suggest, motivate, or even directly imply learning algorithms. What We Know Now There are several families of bounds, based on how information is used. Testing Bounds . These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound , progressive validation also here and here , train and test bounds , and cross-validation (but see the big open problem ). These tec

6 0.098124109 466 hunch net-2012-06-05-ICML acceptance statistics

7 0.094786935 388 hunch net-2010-01-24-Specializations of the Master Problem

8 0.080843166 55 hunch net-2005-04-10-Is the Goal Understanding or Prediction?

9 0.07389345 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

10 0.073149413 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

11 0.073043078 293 hunch net-2008-03-23-Interactive Machine Learning

12 0.072686732 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science

13 0.071779765 83 hunch net-2005-06-18-Lower Bounds for Learning Reductions

14 0.070713542 432 hunch net-2011-04-20-The End of the Beginning of Active Learning

15 0.070563152 332 hunch net-2008-12-23-Use of Learning Theory

16 0.068724796 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

17 0.067445114 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

18 0.066856191 137 hunch net-2005-12-09-Machine Learning Thoughts

19 0.066759869 334 hunch net-2009-01-07-Interesting Papers at SODA 2009

20 0.065713242 368 hunch net-2009-08-26-Another 10-year paper in Machine Learning


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.136), (1, 0.054), (2, -0.004), (3, -0.004), (4, 0.056), (5, -0.033), (6, 0.057), (7, -0.025), (8, 0.022), (9, -0.002), (10, 0.105), (11, 0.054), (12, 0.052), (13, 0.041), (14, -0.055), (15, -0.0), (16, -0.036), (17, 0.027), (18, -0.065), (19, 0.074), (20, 0.012), (21, 0.033), (22, -0.112), (23, -0.001), (24, -0.081), (25, 0.036), (26, -0.061), (27, -0.015), (28, -0.027), (29, -0.014), (30, -0.071), (31, -0.076), (32, 0.035), (33, 0.035), (34, -0.054), (35, -0.029), (36, -0.068), (37, -0.023), (38, -0.044), (39, -0.004), (40, -0.086), (41, -0.063), (42, -0.108), (43, -0.005), (44, -0.067), (45, -0.036), (46, 0.088), (47, 0.032), (48, -0.059), (49, -0.011)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95374113 392 hunch net-2010-03-26-A Variance only Deviation Bound

Introduction: At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable . For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance. The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by

2 0.76586145 244 hunch net-2007-05-09-The Missing Bound

Introduction: Sham Kakade points out that we are missing a bound. Suppose we have m samples x drawn IID from some distribution D . Through the magic of exponential moment method we know that: If the range of x is bounded by an interval of size I , a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m 0.5 ) (at least in crude form). A proof is on page 9 here . If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small. What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning,

3 0.66912264 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

Introduction: This post is about contextual bandit problems where, repeatedly: The world chooses features x and rewards for each action r 1 ,…,r k then announces the features x (but not the rewards). A policy chooses an action a . The world announces the reward r a The goal in these situations is to learn a policy which maximizes r a in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction. One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k 0.5 , epoch-greedy (and the simpler epsilon greedy) have a dependence on k 1/3 , and the offset tree has a dependence on k-1 . These results aren’t directly comparable because different things a

4 0.59601015 55 hunch net-2005-04-10-Is the Goal Understanding or Prediction?

Introduction: Steve Smale and I have a debate about goals of learning theory. Steve likes theorems with a dependence on unobservable quantities. For example, if D is a distribution over a space X x [0,1] , you can state a theorem about the error rate dependent on the variance, E (x,y)~D (y-E y’~D|x [y']) 2 . I dislike this, because I want to use the theorems to produce code solving learning problems. Since I don’t know (and can’t measure) the variance, a theorem depending on the variance does not help me—I would not know what variance to plug into the learning algorithm. Recast more broadly, this is a debate between “declarative” and “operative” mathematics. A strong example of “declarative” mathematics is “a new kind of science” . Roughly speaking, the goal of this kind of approach seems to be finding a way to explain the observations we make. Examples include “some things are unpredictable”, “a phase transition exists”, etc… “Operative” mathematics helps you make predictions a

5 0.57884365 41 hunch net-2005-03-15-The State of Tight Bounds

Introduction: What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals. Why? Good Judgement . In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not. Learning Essence . The form of some of these bounds helps you understand what the essence of learning is. Algorithm Design . Some of these bounds suggest, motivate, or even directly imply learning algorithms. What We Know Now There are several families of bounds, based on how information is used. Testing Bounds . These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound , progressive validation also here and here , train and test bounds , and cross-validation (but see the big open problem ). These tec

6 0.57351863 170 hunch net-2006-04-06-Bounds greater than 1

7 0.53772902 269 hunch net-2007-10-24-Contextual Bandits

8 0.53502113 115 hunch net-2005-09-26-Prediction Bounds as the Mathematics of Science

9 0.53460902 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

10 0.51123649 85 hunch net-2005-06-28-A COLT paper

11 0.50891417 400 hunch net-2010-06-13-The Good News on Exploration and Learning

12 0.50435853 388 hunch net-2010-01-24-Specializations of the Master Problem

13 0.47381386 163 hunch net-2006-03-12-Online learning or online preservation of learning?

14 0.45706335 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms

15 0.44374308 70 hunch net-2005-05-12-Math on the Web

16 0.43460056 368 hunch net-2009-08-26-Another 10-year paper in Machine Learning

17 0.42482251 58 hunch net-2005-04-21-Dynamic Programming Generalizations and Their Use

18 0.40431157 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

19 0.40159011 220 hunch net-2006-11-27-Continuizing Solutions

20 0.39500132 293 hunch net-2008-03-23-Interactive Machine Learning


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(10, 0.047), (27, 0.182), (35, 0.315), (53, 0.042), (55, 0.068), (77, 0.117), (94, 0.076), (95, 0.04)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.91035759 130 hunch net-2005-11-16-MLSS 2006

Introduction: There will be two machine learning summer schools in 2006. One is in Canberra, Australia from February 6 to February 17 (Aussie summer). The webpage is fully ‘live’ so you should actively consider it now. The other is in Taipei, Taiwan from July 24 to August 4. This one is still in the planning phase, but that should be settled soon. Attending an MLSS is probably the quickest and easiest way to bootstrap yourself into a reasonable initial understanding of the field of machine learning.

same-blog 2 0.8704443 392 hunch net-2010-03-26-A Variance only Deviation Bound

Introduction: At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable . For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance. The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by

3 0.79868758 73 hunch net-2005-05-17-A Short Guide to PhD Graduate Study

Introduction: Graduate study is a mysterious and uncertain process. This easiest way to see this is by noting that a very old advisor/student mechanism is preferred. There is no known succesful mechanism for “mass producing” PhDs as is done (in some sense) for undergraduate and masters study. Here are a few hints that might be useful to prospective or current students based on my own experience. Masters or PhD (a) You want a PhD if you want to do research. (b) You want a masters if you want to make money. People wanting (b) will be manifestly unhappy with (a) because it typically means years of low pay. People wanting (a) should try to avoid (b) because it prolongs an already long process. Attitude . Many students struggle for awhile with the wrong attitude towards research. Most students come into graduate school with 16-19 years of schooling where the principle means of success is proving that you know something via assignments, tests, etc… Research does not work this way. Re

4 0.59937918 269 hunch net-2007-10-24-Contextual Bandits

Introduction: One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way. The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem. A standard mathematical setting for this situation is “ k -Armed Bandits”, often with various relevant embellishments. The k -Armed Bandit setting works on a round-by-round basis. On each round: A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals t

5 0.59731913 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

6 0.58842599 388 hunch net-2010-01-24-Specializations of the Master Problem

7 0.58387589 206 hunch net-2006-09-09-How to solve an NP hard problem in quadratic time

8 0.58269262 317 hunch net-2008-09-12-How do we get weak action dependence for learning with partial observations?

9 0.56555897 375 hunch net-2009-10-26-NIPS workshops

10 0.54871988 235 hunch net-2007-03-03-All Models of Learning have Flaws

11 0.54433864 436 hunch net-2011-06-22-Ultra LDA

12 0.54257905 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

13 0.54035592 220 hunch net-2006-11-27-Continuizing Solutions

14 0.54029679 258 hunch net-2007-08-12-Exponentiated Gradient

15 0.5370869 237 hunch net-2007-04-02-Contextual Scaling

16 0.53492534 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

17 0.53440869 293 hunch net-2008-03-23-Interactive Machine Learning

18 0.53329366 332 hunch net-2008-12-23-Use of Learning Theory

19 0.53306824 343 hunch net-2009-02-18-Decision by Vetocracy

20 0.5327149 351 hunch net-2009-05-02-Wielding a New Abstraction