hunch_net hunch_net-2006 hunch_net-2006-179 knowledge-graph by maker-knowledge-mining

179 hunch net-2006-05-16-The value of the orthodox view of Boosting

meta infos for this blog

Source: html

Introduction: The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page ). Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics , including a response by Yoav Freund and Robert Schapire . I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties. Comparing wit

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). [sent-1, score-0.52]

2 Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page ). [sent-2, score-0.908]

3 Since then, many different interpretations of why boosting works have arisen. [sent-3, score-0.413]

4 I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). [sent-5, score-1.353]

5 This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties. [sent-6, score-0.22]

6 Comparing with all other views of boosting is too clumsy, so I will pick one: “boosting coordinate-wise gradient descent (CWGD for short) on an exponential loss function” which started here and compare it with Adaboost. [sent-7, score-0.874]

7 There are two advantages of the “weak to strong learning” view: Automatic computational speedups. [sent-8, score-0.212]

8 In the “weak to strong learning” view, you automatically think about using a learning algorithm as a subroutine. [sent-9, score-0.286]

9 5 (or some other algorithm) to pick the coordinate is an unmotivated decision. [sent-12, score-0.33]

10 The straightforward thing to do is simply check each coordinate in turn which yields no computational speedups. [sent-13, score-0.256]

11 This is unsurprising—simply consider the limit where only one round of boosting is done. [sent-16, score-0.411]

12 The point here is not that the old view subsumes the CWGD view, but rather that the CWGD view does not account for all the value in the old view. [sent-18, score-1.008]

13 In particular, the CWGD view suggests that a broader family of algorithms may be useful than the weak-to-strong view might suggest. [sent-19, score-1.07]

14 Some methods are harder to use than others, so it is useful to think about what works well. [sent-22, score-0.215]

15 Gradient descent is a core algorithmic tool in machine learning. [sent-23, score-0.195]

16 After making a sequence of more-or-less unmotivated steps, we can derive Adaboost (and other algorithms) as an application of gradient descent. [sent-24, score-0.384]

17 Or, we can study the notion of boosting weak learning to achieve strong learning and derive Adaboost. [sent-25, score-0.937]

18 My impression is that the “weak learning to achieve strong learning” view is significantly more difficult to master than gradient descent, but it is also a significantly more precise mechanism for deriving useful algorithms. [sent-26, score-1.141]

19 There are many gradient descent algorithms which are not very useful in machine learning. [sent-27, score-0.479]

20 Amongst other things, the “weak to strong” view significantly informed some of the early development of learning reductions . [sent-28, score-0.545]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('cwgd', 0.456), ('view', 0.397), ('boosting', 0.36), ('weak', 0.256), ('adaboost', 0.177), ('gradient', 0.161), ('strong', 0.16), ('descent', 0.148), ('views', 0.125), ('unmotivated', 0.125), ('coordinate', 0.125), ('useful', 0.109), ('learner', 0.108), ('learners', 0.108), ('derive', 0.098), ('significantly', 0.096), ('pick', 0.08), ('turn', 0.079), ('viewpoint', 0.074), ('value', 0.072), ('old', 0.071), ('meta', 0.068), ('bottom', 0.068), ('using', 0.066), ('achieve', 0.063), ('coincidence', 0.063), ('algorithms', 0.061), ('algorithm', 0.06), ('clumsy', 0.059), ('floating', 0.059), ('deriving', 0.059), ('barely', 0.059), ('unsurprising', 0.056), ('family', 0.054), ('nontrivial', 0.054), ('yoav', 0.054), ('discussion', 0.054), ('performance', 0.053), ('methods', 0.053), ('works', 0.053), ('broader', 0.052), ('informed', 0.052), ('computational', 0.052), ('consequence', 0.051), ('freund', 0.051), ('round', 0.051), ('others', 0.05), ('predict', 0.05), ('comparing', 0.049), ('tool', 0.047)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999982 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

2 0.18106331 111 hunch net-2005-09-12-Fast Gradient Descent

Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be

3 0.16290577 466 hunch net-2012-06-05-ICML acceptance statistics

Introduction: People are naturally interested in slicing the ICML acceptance statistics in various ways. Here’s a rundown for the top categories. 18/66 = 0.27 in (0.18,0.36) Reinforcement Learning 10/52 = 0.19 in (0.17,0.37) Supervised Learning 9/51 = 0.18 not in (0.18, 0.37) Clustering 12/46 = 0.26 in (0.17, 0.37) Kernel Methods 11/40 = 0.28 in (0.15, 0.4) Optimization Algorithms 8/33 = 0.24 in (0.15, 0.39) Learning Theory 14/33 = 0.42 not in (0.15, 0.39) Graphical Models 10/32 = 0.31 in (0.15, 0.41) Applications (+5 invited) 8/29 = 0.28 in (0.14, 0.41]) Probabilistic Models 13/29 = 0.45 not in (0.14, 0.41) NN & Deep Learning 8/26 = 0.31 in (0.12, 0.42) Transfer and Multi-Task Learning 13/25 = 0.52 not in (0.12, 0.44) Online Learning 5/25 = 0.20 in (0.12, 0.44) Active Learning 6/22 = 0.27 in (0.14, 0.41) Semi-Superv

4 0.13320711 9 hunch net-2005-02-01-Watchword: Loss

Introduction: A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss. There seems to be a strong dichotomy between two views of what “loss” means in learning. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l 2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SV

5 0.12463742 167 hunch net-2006-03-27-Gradients everywhere

Introduction: One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. Sometimes there are clever approximations of the gradient. ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. ( Drew Bagnell ) Even give

6 0.12056065 235 hunch net-2007-03-03-All Models of Learning have Flaws

7 0.11759502 16 hunch net-2005-02-09-Intuitions from applied learning

8 0.11752553 258 hunch net-2007-08-12-Exponentiated Gradient

9 0.11631461 279 hunch net-2007-12-19-Cool and interesting things seen at NIPS

10 0.11037062 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

11 0.10801393 133 hunch net-2005-11-28-A question of quantification

12 0.091204517 345 hunch net-2009-03-08-Prediction Science

13 0.087559849 332 hunch net-2008-12-23-Use of Learning Theory

14 0.085690573 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms

15 0.085459217 12 hunch net-2005-02-03-Learning Theory, by assumption

16 0.082416862 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

17 0.080832295 308 hunch net-2008-07-06-To Dual or Not

18 0.08063747 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

19 0.079933673 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

20 0.079733729 393 hunch net-2010-04-14-MLcomp: a website for objectively comparing ML algorithms

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.188), (1, 0.069), (2, -0.005), (3, -0.007), (4, 0.013), (5, 0.016), (6, -0.106), (7, -0.006), (8, -0.006), (9, 0.038), (10, -0.01), (11, -0.039), (12, -0.038), (13, -0.084), (14, 0.052), (15, 0.095), (16, 0.04), (17, -0.033), (18, -0.01), (19, -0.005), (20, 0.003), (21, -0.042), (22, -0.006), (23, -0.032), (24, -0.071), (25, -0.036), (26, -0.041), (27, 0.045), (28, 0.036), (29, -0.053), (30, -0.094), (31, -0.033), (32, -0.075), (33, -0.128), (34, -0.112), (35, -0.034), (36, -0.109), (37, -0.078), (38, 0.059), (39, 0.111), (40, 0.001), (41, -0.009), (42, 0.031), (43, -0.022), (44, -0.044), (45, -0.034), (46, 0.072), (47, 0.062), (48, 0.032), (49, 0.104)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95118773 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

2 0.79096568 111 hunch net-2005-09-12-Fast Gradient Descent

3 0.75198615 197 hunch net-2006-07-17-A Winner

Introduction: Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. This bears a strong resemblance to a neural network. The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).

4 0.72335875 167 hunch net-2006-03-27-Gradients everywhere

5 0.65906477 258 hunch net-2007-08-12-Exponentiated Gradient

Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve

6 0.55782151 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

7 0.54967207 348 hunch net-2009-04-02-Asymmophobia

8 0.54669017 345 hunch net-2009-03-08-Prediction Science

9 0.54511797 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

10 0.52307302 205 hunch net-2006-09-07-Objective and subjective interpretations of probability

11 0.51589489 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

12 0.4939504 126 hunch net-2005-10-26-Fallback Analysis is a Secret to Useful Algorithms

13 0.48432925 177 hunch net-2006-05-05-An ICML reject

14 0.47374222 235 hunch net-2007-03-03-All Models of Learning have Flaws

15 0.4659147 466 hunch net-2012-06-05-ICML acceptance statistics

16 0.46231499 253 hunch net-2007-07-06-Idempotent-capable Predictors

17 0.46178618 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

18 0.45708823 420 hunch net-2010-12-26-NIPS 2010

19 0.45044607 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

20 0.44877139 308 hunch net-2008-07-06-To Dual or Not

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.014), (27, 0.224), (38, 0.031), (51, 0.326), (53, 0.088), (55, 0.119), (94, 0.044), (95, 0.051)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.95048606 324 hunch net-2008-11-09-A Healthy COLT

Introduction: A while ago , we discussed the health of COLT . COLT 2008 substantially addressed my concerns. The papers were diverse and several were interesting. Attendance was up, which is particularly notable in Europe. In my opinion, the colocation with UAI and ICML was the best colocation since 1998. And, perhaps best of all, registration ended up being free for all students due to various grants from the Academy of Finland , Google , IBM , and Yahoo . A basic question is: what went right? There seem to be several answers. Cost-wise, COLT had sufficient grants to alleviate the high cost of the Euro and location at a university substantially reduces the cost compared to a hotel. Organization-wise, the Finns were great with hordes of volunteers helping set everything up. Having too many volunteers is a good failure mode. Organization-wise, it was clear that all 3 program chairs were cooperating in designing the program. Facilities-wise, proximity in time and space made

2 0.92789304 489 hunch net-2013-09-20-No NY ML Symposium in 2013, and some good news

Introduction: There will be no New York ML Symposium this year. The core issue is that NYAS is disorganized by people leaving, pushing back the date, with the current candidate a spring symposium on March 28. Gunnar and I were outvoted hereâ€”we were gung ho on organizing a fall symposium, but the rest of the committee wants to wait. In some good news, most of the ICML 2012 videos have been restored from a deep backup.

3 0.89157408 334 hunch net-2009-01-07-Interesting Papers at SODA 2009

Introduction: Several talks seem potentially interesting to ML folks at this year’s SODA. Maria-Florina Balcan , Avrim Blum , and Anupam Gupta , Approximate Clustering without the Approximation . This paper gives reasonable algorithms with provable approximation guarantees for k-median and other notions of clustering. It’s conceptually interesting, because it’s the second example I’ve seen where NP hardness is subverted by changing the problem definition subtle but reasonable way. Essentially, they show that if any near-approximation to an optimal solution is good, then it’s computationally easy to find a near-optimal solution. This subtle shift bears serious thought. A similar one occurred in our ranking paper with respect to minimum feedback arcset. With two known examples, it suggests that many more NP-complete problems might be finessed into irrelevance in this style. Yury Lifshits and Shengyu Zhang , Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates, and Smal

4 0.882321 393 hunch net-2010-04-14-MLcomp: a website for objectively comparing ML algorithms

Introduction: Much of the success and popularity of machine learning has been driven by its practical impact. Of course, the evaluation of empirical work is an integral part of the field. But are the existing mechanisms for evaluating algorithms and comparing results good enough? We ( Percy and Jake ) believe there are currently a number of shortcomings: Incomplete Disclosure: You read a paper that proposes Algorithm A which is shown to outperform SVMs on two datasets. Great. But what about on other datasets? How sensitive is this result? What about compute time – does the algorithm take two seconds on a laptop or two weeks on a 100-node cluster? Lack of Standardization: Algorithm A beats Algorithm B on one version of a dataset. Algorithm B beats Algorithm A on another version yet uses slightly different preprocessing. Though doing a head-on comparison would be ideal, it would be tedious since the programs probably use different dataset formats and have a large array of options

same-blog 5 0.86663932 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

6 0.80690771 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge

7 0.71894366 235 hunch net-2007-03-03-All Models of Learning have Flaws

8 0.67266345 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

9 0.66320711 416 hunch net-2010-10-29-To Vidoelecture or not

10 0.66156918 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

11 0.66082156 309 hunch net-2008-07-10-Interesting papers, ICML 2008

12 0.65387613 403 hunch net-2010-07-18-ICML & COLT 2010

13 0.65114838 406 hunch net-2010-08-22-KDD 2010

14 0.65036923 132 hunch net-2005-11-26-The Design of an Optimal Research Environment

15 0.6451031 204 hunch net-2006-08-28-Learning Theory standards for NIPS 2006

16 0.63931042 225 hunch net-2007-01-02-Retrospective

17 0.63714105 256 hunch net-2007-07-20-Motivation should be the Responsibility of the Reviewer

18 0.63548428 475 hunch net-2012-10-26-ML Symposium and Strata-Hadoop World

19 0.63333338 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

20 0.63156873 258 hunch net-2007-08-12-Exponentiated Gradient