hunch_net hunch_net-2005 hunch_net-2005-133 knowledge-graph by maker-knowledge-mining

133 hunch net-2005-11-28-A question of quantification


meta infos for this blog

Source: html

Introduction: This is about methods for phrasing and think about the scope of some theorems in learning theory. The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same. For all sequences of examples . This is the standard quantification in online learning analysis. Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert.” For all training sets . This is the standard quantification for boosting analysis such as adaboost or multiclass boosting . Standard theorems have the form “for all training sets the error rate inequalities … hold”. For all distributions over examples . This is the one that we have been using for reductions analysis. Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”. It is not quite true that each of these is equivalent. F


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This is about methods for phrasing and think about the scope of some theorems in learning theory. [sent-1, score-0.455]

2 The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same. [sent-2, score-0.348]

3 This is the standard quantification in online learning analysis. [sent-4, score-0.485]

4 Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert. [sent-5, score-0.52]

5 This is the standard quantification for boosting analysis such as adaboost or multiclass boosting . [sent-7, score-0.813]

6 Standard theorems have the form “for all training sets the error rate inequalities … hold”. [sent-8, score-0.966]

7 Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”. [sent-11, score-0.919]

8 For example, in the online learning setting, quantifying “for all sequences of examples” implies “for all distributions over examples”, but not vice-versa. [sent-13, score-0.931]

9 However, in the context of either boosting or reductions these are equivalent because the algorithms operate in an element-wise fashion. [sent-14, score-0.391]

10 To see the equivalence, note that: “For any training set” is equivalent to “For any sequence of examples” because a training set is a sequence and vice versa. [sent-15, score-0.968]

11 If the theorem holds “for all distributions”, it holds for the uniform distribution over the elements in any sequence of examples. [sent-17, score-0.766]

12 The natural debate here is “how should the theorems be quantified? [sent-18, score-0.327]

13 ” It is difficult to answer this debate based upon mathematical grounds because we just showed an equivalence. [sent-19, score-0.224]

14 It is nevertheless important because it strongly influences how we think about algorithms and how easy it is to integrate the knowledge across different theories. [sent-20, score-0.243]

15 Learning theory people (at least) are used to thinking about “For all sequences of examples”. [sent-23, score-0.349]

16 When the algorithm is example-conditional such as in online learning, the quantification is more general than “for all distributions”. [sent-25, score-0.337]

17 For example, a version of the adaboost theorem also applies to test sets using the test error rates of the base classifiers. [sent-29, score-0.695]

18 Distributions over examples is simply how most people think about learning problems. [sent-32, score-0.391]

19 “For all distributions over examples” is easily and often confused with “For all distributions over examples accessed by IID draws”. [sent-33, score-1.24]

20 What quantification should be used and why? [sent-35, score-0.265]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('distributions', 0.424), ('examples', 0.329), ('sequences', 0.303), ('quantification', 0.265), ('theorems', 0.217), ('sequence', 0.214), ('sets', 0.192), ('training', 0.171), ('inequalities', 0.17), ('standard', 0.148), ('boosting', 0.134), ('quantifying', 0.132), ('adaboost', 0.132), ('equivalent', 0.132), ('distribution', 0.127), ('debate', 0.11), ('theorem', 0.109), ('holds', 0.107), ('uniform', 0.102), ('scope', 0.1), ('hold', 0.098), ('error', 0.088), ('draws', 0.076), ('misleadingly', 0.076), ('phrasing', 0.076), ('online', 0.072), ('form', 0.071), ('reductions', 0.07), ('transformations', 0.066), ('equivalence', 0.066), ('influences', 0.066), ('vice', 0.066), ('test', 0.063), ('confused', 0.063), ('grounds', 0.063), ('think', 0.062), ('arbitrarily', 0.061), ('different', 0.058), ('rate', 0.057), ('integrate', 0.057), ('operate', 0.055), ('thanks', 0.055), ('confusion', 0.052), ('encounter', 0.051), ('arguments', 0.051), ('sufficiently', 0.051), ('showed', 0.051), ('applies', 0.048), ('approximate', 0.047), ('theory', 0.046)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 133 hunch net-2005-11-28-A question of quantification

Introduction: This is about methods for phrasing and think about the scope of some theorems in learning theory. The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same. For all sequences of examples . This is the standard quantification in online learning analysis. Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert.” For all training sets . This is the standard quantification for boosting analysis such as adaboost or multiclass boosting . Standard theorems have the form “for all training sets the error rate inequalities … hold”. For all distributions over examples . This is the one that we have been using for reductions analysis. Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”. It is not quite true that each of these is equivalent. F

2 0.17987184 236 hunch net-2007-03-15-Alternative Machine Learning Reductions Definitions

Introduction: A type of prediction problem is specified by the type of samples produced by a data source (Example: X x {0,1} , X x [0,1] , X x {1,2,3,4,5} , etc…) and a loss function (0/1 loss, squared error loss, cost sensitive losses, etc…). For simplicity, we’ll assume that all losses have a minimum of zero. For this post, we can think of a learning reduction as A mapping R from samples of one type T (like multiclass classification) to another type T’ (like binary classification). A mapping Q from predictors for type T’ to predictors for type T . The simplest sort of learning reduction is a “loss reduction”. The idea in a loss reduction is to prove a statement of the form: Theorem For all base predictors b , for all distributions D over examples of type T : E (x,y) ~ D L T (y,Q(b,x)) <= f(E (x’,y’)~R(D) L T’ (y’,b(x’))) Here L T is the loss for the type T problem and L T’ is the loss for the type T’ problem. Also, R(D) is the distribution ov

3 0.17237541 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

Introduction: I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be The Problem The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions. In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.

4 0.16874725 104 hunch net-2005-08-22-Do you believe in induction?

Introduction: Foster Provost gave a talk at the ICML metalearning workshop on “metalearning” and the “no free lunch theorem” which seems worth summarizing. As a review: the no free lunch theorem is the most complicated way we know of to say that a bias is required in order to learn. The simplest way to see this is in a nonprobabilistic setting. If you are given examples of the form (x,y) and you wish to predict y from x then any prediction mechanism errs half the time in expectation over all sequences of examples. The proof of this is very simple: on every example a predictor must make some prediction and by symmetry over the set of sequences it will be wrong half the time and right half the time. The basic idea of this proof has been applied to many other settings. The simplistic interpretation of this theorem which many people jump to is “machine learning is dead” since there can be no single learning algorithm which can solve all learning problems. This is the wrong way to thi

5 0.15412685 14 hunch net-2005-02-07-The State of the Reduction

Introduction: What? Reductions are machines which turn solvers for one problem into solvers for another problem. Why? Reductions are useful for several reasons. Laziness . Reducing a problem to classification make at least 10 learning algorithms available to solve a problem. Inventing 10 learning algorithms is quite a bit of work. Similarly, programming a reduction is often trivial, while programming a learning algorithm is a great deal of work. Crystallization . The problems we often want to solve in learning are worst-case-impossible, but average case feasible. By reducing all problems onto one or a few primitives, we can fine tune these primitives to perform well on real-world problems with greater precision due to the greater number of problems to validate on. Theoretical Organization . By studying what reductions are easy vs. hard vs. impossible, we can learn which problems are roughly equivalent in difficulty and which are much harder. What we know now . Typesafe r

6 0.12461719 8 hunch net-2005-02-01-NIPS: Online Bayes

7 0.12087935 258 hunch net-2007-08-12-Exponentiated Gradient

8 0.1168811 19 hunch net-2005-02-14-Clever Methods of Overfitting

9 0.11647186 12 hunch net-2005-02-03-Learning Theory, by assumption

10 0.11609358 34 hunch net-2005-03-02-Prior, “Prior” and Bias

11 0.11311492 245 hunch net-2007-05-12-Loss Function Semantics

12 0.1123533 251 hunch net-2007-06-24-Interesting Papers at ICML 2007

13 0.10801393 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

14 0.10522195 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

15 0.10059495 252 hunch net-2007-07-01-Watchword: Online Learning

16 0.099206686 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

17 0.097127683 332 hunch net-2008-12-23-Use of Learning Theory

18 0.096267529 183 hunch net-2006-06-14-Explorations of Exploration

19 0.096264519 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

20 0.090207763 204 hunch net-2006-08-28-Learning Theory standards for NIPS 2006


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.192), (1, 0.157), (2, 0.067), (3, -0.055), (4, 0.025), (5, -0.04), (6, 0.038), (7, -0.003), (8, 0.01), (9, 0.009), (10, 0.025), (11, 0.029), (12, 0.071), (13, -0.087), (14, 0.063), (15, 0.01), (16, 0.028), (17, -0.042), (18, 0.003), (19, 0.0), (20, -0.02), (21, -0.072), (22, 0.013), (23, -0.113), (24, 0.033), (25, -0.018), (26, 0.008), (27, -0.028), (28, -0.025), (29, -0.001), (30, -0.02), (31, 0.05), (32, -0.063), (33, -0.057), (34, 0.001), (35, -0.028), (36, 0.013), (37, 0.064), (38, 0.026), (39, -0.036), (40, 0.0), (41, -0.042), (42, -0.012), (43, 0.006), (44, -0.009), (45, -0.066), (46, 0.007), (47, 0.098), (48, 0.045), (49, -0.033)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96521741 133 hunch net-2005-11-28-A question of quantification

Introduction: This is about methods for phrasing and think about the scope of some theorems in learning theory. The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same. For all sequences of examples . This is the standard quantification in online learning analysis. Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert.” For all training sets . This is the standard quantification for boosting analysis such as adaboost or multiclass boosting . Standard theorems have the form “for all training sets the error rate inequalities … hold”. For all distributions over examples . This is the one that we have been using for reductions analysis. Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”. It is not quite true that each of these is equivalent. F

2 0.71865749 104 hunch net-2005-08-22-Do you believe in induction?

Introduction: Foster Provost gave a talk at the ICML metalearning workshop on “metalearning” and the “no free lunch theorem” which seems worth summarizing. As a review: the no free lunch theorem is the most complicated way we know of to say that a bias is required in order to learn. The simplest way to see this is in a nonprobabilistic setting. If you are given examples of the form (x,y) and you wish to predict y from x then any prediction mechanism errs half the time in expectation over all sequences of examples. The proof of this is very simple: on every example a predictor must make some prediction and by symmetry over the set of sequences it will be wrong half the time and right half the time. The basic idea of this proof has been applied to many other settings. The simplistic interpretation of this theorem which many people jump to is “machine learning is dead” since there can be no single learning algorithm which can solve all learning problems. This is the wrong way to thi

3 0.6751821 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

Introduction: Accountability is a social problem. When someone screws up, do you fire them? Or do you accept the error and let them continue? This is a very difficult problem and we all know of stories where the wrong decision was made. Online learning (as meant here), is a subfield of learning theory which analyzes the online learning model. In the online learning model, there are a set of hypotheses or “experts”. On any instantance x , each expert makes a prediction y . A master algorithm A uses these predictions to form it’s own prediction y A and then learns the correct prediction y * . This process repeats. The goal of online learning is to find a master algorithm A which uses the advice of the experts to make good predictions. In particular, we typically want to guarantee that the master algorithm performs almost as well as the best expert. If L(e) is the loss of expert e and L(A) is the loss of the master algorithm, it is often possible to prove: L(A) les

4 0.66232008 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

Introduction: I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be The Problem The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions. In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.

5 0.65180767 8 hunch net-2005-02-01-NIPS: Online Bayes

Introduction: One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms . From the paper: The philosophy taken in the Bayesian methodology is often at odds with that in the online learning community…. the online learning setting makes rather minimal assumptions on the conditions under which the data are being presented to the learner —usually, Nature could provide examples in an adversarial manner. We study the performance of Bayesian algorithms in a more adversarial setting… We provide competitive bounds when the cost function is the log loss, and we compare our performance to the best model in our model class (as in the experts setting). It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem: Let Q be any distribution over parameters theta. Then for all sequences S: L_{Bayes}(S) leq L_Q(S)

6 0.65101337 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms

7 0.64887494 258 hunch net-2007-08-12-Exponentiated Gradient

8 0.64858848 43 hunch net-2005-03-18-Binomial Weighting

9 0.64825487 163 hunch net-2006-03-12-Online learning or online preservation of learning?

10 0.62544793 12 hunch net-2005-02-03-Learning Theory, by assumption

11 0.61517352 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics

12 0.61220622 160 hunch net-2006-03-02-Why do people count for learning?

13 0.60836107 7 hunch net-2005-01-31-Watchword: Assumption

14 0.60756624 56 hunch net-2005-04-14-Families of Learning Theory Statements

15 0.59981263 28 hunch net-2005-02-25-Problem: Online Learning

16 0.58692056 57 hunch net-2005-04-16-Which Assumptions are Reasonable?

17 0.5865351 34 hunch net-2005-03-02-Prior, “Prior” and Bias

18 0.58488995 347 hunch net-2009-03-26-Machine Learning is too easy

19 0.56557345 14 hunch net-2005-02-07-The State of the Reduction

20 0.56105804 177 hunch net-2006-05-05-An ICML reject


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(0, 0.239), (3, 0.029), (27, 0.363), (38, 0.106), (53, 0.045), (55, 0.057), (94, 0.056)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97926432 74 hunch net-2005-05-21-What is the right form of modularity in structured prediction?

Introduction: Suppose you are given a sequence of observations x 1 ,…,x T from some space and wish to predict a sequence of labels y 1 ,…,y T so as to minimize the Hamming loss: sum i=1 to T I(y i != c(x 1 ,…,x T ) i ) where c(x 1 ,…,x T ) i is the i th predicted component. For simplicity, suppose each label y i is in {0,1} . We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component y i independently since the loss is a linear combination of losses on each individual component i . From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te . This breakup into T different prediction problems is not the standard form of modularity in structured prediction. A more typical form of modularity is to predict y i given x i , y i-1 , y i+1 where the circularity (predicting given other

2 0.96036381 473 hunch net-2012-09-29-Vowpal Wabbit, version 7.0

Introduction: A new version of VW is out . The primary changes are: Learning Reductions : I’ve wanted to get learning reductions working and we’ve finally done it. Not everything is implemented yet, but VW now supports direct: Multiclass Classification –oaa or –ect . Cost Sensitive Multiclass Classification –csoaa or –wap . Contextual Bandit Classification –cb . Sequential Structured Prediction –searn or –dagger In addition, it is now easy to build your own custom learning reductions for various plausible uses: feature diddling, custom structured prediction problems, or alternate learning reductions. This effort is far from done, but it is now in a generally useful state. Note that all learning reductions inherit the ability to do cluster parallel learning. Library interface : VW now has a basic library interface. The library provides most of the functionality of VW, with the limitation that it is monolithic and nonreentrant. These will be improved over

same-blog 3 0.94824797 133 hunch net-2005-11-28-A question of quantification

Introduction: This is about methods for phrasing and think about the scope of some theorems in learning theory. The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same. For all sequences of examples . This is the standard quantification in online learning analysis. Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert.” For all training sets . This is the standard quantification for boosting analysis such as adaboost or multiclass boosting . Standard theorems have the form “for all training sets the error rate inequalities … hold”. For all distributions over examples . This is the one that we have been using for reductions analysis. Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”. It is not quite true that each of these is equivalent. F

4 0.9360587 215 hunch net-2006-10-22-Exemplar programming

Introduction: There are many different abstractions for problem definition and solution. Here are a few examples: Functional programming: a set of functions are defined. The composed execution of these functions yields the solution. Linear programming: a set of constraints and a linear objective function are defined. An LP solver finds the constrained optimum. Quadratic programming: Like linear programming, but the language is a little more flexible (and the solution slower). Convex programming: like quadratic programming, but the language is more flexible (and the solutions even slower). Dynamic programming: a recursive definition of the problem is defined and then solved efficiently via caching tricks. SAT programming: A problem is specified as a satisfiability involving a conjunction of a disjunction of boolean variables. A general engine attempts to find a good satisfying assignment. For example Kautz’s blackbox planner. These abstractions have different tradeoffs betw

5 0.92673224 62 hunch net-2005-04-26-To calibrate or not?

Introduction: A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p , the proportion of the time that 1 is observed is p . Since there are infinitely many p , this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p . A great deal of effort has been devoted to strategies for achieving calibrated (such as here ) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions. Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way: We want good probabilistic predictions. Good probabilistic predictions are calibrated. Therefore, we want calibrated predictions. The “Therefore” step misses the fact that calibration is a necessary b

6 0.9159286 87 hunch net-2005-06-29-Not EM for clustering at COLT

7 0.90174466 193 hunch net-2006-07-09-The Stock Prediction Machine Learning Problem

8 0.86363828 14 hunch net-2005-02-07-The State of the Reduction

9 0.85247159 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

10 0.84292698 492 hunch net-2013-12-01-NIPS tutorials and Vowpal Wabbit 7.4

11 0.8403101 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

12 0.83927256 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

13 0.83674926 131 hunch net-2005-11-16-The Everything Ensemble Edge

14 0.83495826 72 hunch net-2005-05-16-Regret minimizing vs error limiting reductions

15 0.83452201 352 hunch net-2009-05-06-Machine Learning to AI

16 0.83363545 245 hunch net-2007-05-12-Loss Function Semantics

17 0.83157682 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

18 0.82941854 258 hunch net-2007-08-12-Exponentiated Gradient

19 0.82816213 9 hunch net-2005-02-01-Watchword: Loss

20 0.82813478 259 hunch net-2007-08-19-Choice of Metrics