hunch_net hunch_net-2006 hunch_net-2006-167 knowledge-graph by maker-knowledge-mining

167 hunch net-2006-03-27-Gradients everywhere


meta infos for this blog

Source: html

Introduction: One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. Sometimes there are clever approximations of the gradient. ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. ( Drew Bagnell ) Even give


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. [sent-1, score-0.341]

2 For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. [sent-2, score-0.288]

3 The essential useful quality of a gradient is that it decouples local updates from global optimization. [sent-3, score-0.582]

4 Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. [sent-4, score-0.256]

5 It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. [sent-5, score-0.135]

6 Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. [sent-6, score-1.042]

7 Sometimes there are clever approximations of the gradient. [sent-7, score-0.31]

8 ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. [sent-8, score-0.806]

9 ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. [sent-9, score-0.81]

10 ( Drew Bagnell ) Even given a gradient, the choice of update is unclear, and might be cleverly chosen ( Nic Schraudolph ) Perhaps a more extreme example of this is Adaboost which repeatedly reuses a classifier learner to implicitly optimize a gradient. [sent-10, score-0.687]

11 Viewed as a gradient optimization algorithm, Adaboost is a sublinear algorithm (in the number of implicit parameters) when applied to decision trees. [sent-11, score-0.69]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('gradients', 0.405), ('gradient', 0.321), ('adaboost', 0.236), ('clever', 0.202), ('compute', 0.162), ('parameters', 0.152), ('nic', 0.135), ('atomic', 0.135), ('depressed', 0.135), ('iterated', 0.135), ('osindero', 0.135), ('simon', 0.135), ('sublinear', 0.135), ('taskar', 0.135), ('optimization', 0.13), ('sometimes', 0.127), ('calculus', 0.125), ('mildly', 0.125), ('approximating', 0.118), ('anyways', 0.118), ('ben', 0.118), ('textbook', 0.118), ('breaks', 0.113), ('learner', 0.108), ('bagnell', 0.108), ('approximations', 0.108), ('implicit', 0.104), ('determine', 0.104), ('constrained', 0.104), ('drew', 0.098), ('updates', 0.098), ('implicitly', 0.096), ('restated', 0.093), ('chosen', 0.091), ('happened', 0.089), ('speakers', 0.089), ('repeatedly', 0.087), ('global', 0.086), ('nothing', 0.083), ('word', 0.081), ('viewed', 0.081), ('optimize', 0.08), ('trees', 0.079), ('computing', 0.079), ('unclear', 0.077), ('update', 0.077), ('local', 0.077), ('observations', 0.076), ('given', 0.075), ('extreme', 0.073)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 167 hunch net-2006-03-27-Gradients everywhere

Introduction: One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. Sometimes there are clever approximations of the gradient. ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. ( Drew Bagnell ) Even give

2 0.20667326 111 hunch net-2005-09-12-Fast Gradient Descent

Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be

3 0.13170695 258 hunch net-2007-08-12-Exponentiated Gradient

Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve

4 0.12463742 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

Introduction: The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page ). Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics , including a response by Yoav Freund and Robert Schapire . I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties. Comparing wit

5 0.10908284 197 hunch net-2006-07-17-A Winner

Introduction: Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. This bears a strong resemblance to a neural network. The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).

6 0.086227939 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

7 0.082161844 235 hunch net-2007-03-03-All Models of Learning have Flaws

8 0.080979042 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

9 0.080652043 265 hunch net-2007-10-14-NIPS workshp: Learning Problem Design

10 0.074238271 373 hunch net-2009-10-03-Static vs. Dynamic multiclass prediction

11 0.073549613 237 hunch net-2007-04-02-Contextual Scaling

12 0.070190899 177 hunch net-2006-05-05-An ICML reject

13 0.068590701 124 hunch net-2005-10-19-Workshop: Atomic Learning

14 0.066679776 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

15 0.066406935 267 hunch net-2007-10-17-Online as the new adjective

16 0.064822689 330 hunch net-2008-12-07-A NIPS paper

17 0.062182043 491 hunch net-2013-11-21-Ben Taskar is gone

18 0.061175831 441 hunch net-2011-08-15-Vowpal Wabbit 6.0

19 0.05998899 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

20 0.059297882 281 hunch net-2007-12-21-Vowpal Wabbit Code Release


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.135), (1, 0.043), (2, -0.02), (3, -0.029), (4, 0.051), (5, 0.055), (6, -0.071), (7, -0.016), (8, -0.003), (9, 0.057), (10, -0.019), (11, -0.041), (12, -0.001), (13, -0.081), (14, 0.057), (15, 0.06), (16, 0.013), (17, 0.031), (18, -0.022), (19, 0.009), (20, -0.093), (21, -0.024), (22, 0.004), (23, 0.009), (24, -0.045), (25, 0.088), (26, 0.009), (27, -0.002), (28, 0.037), (29, -0.083), (30, -0.122), (31, 0.005), (32, -0.099), (33, -0.119), (34, -0.105), (35, 0.01), (36, -0.167), (37, -0.063), (38, -0.022), (39, 0.055), (40, 0.035), (41, -0.092), (42, 0.053), (43, -0.009), (44, 0.028), (45, 0.086), (46, -0.003), (47, -0.014), (48, 0.024), (49, 0.023)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97799879 167 hunch net-2006-03-27-Gradients everywhere

Introduction: One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. Sometimes there are clever approximations of the gradient. ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. ( Drew Bagnell ) Even give

2 0.86247796 111 hunch net-2005-09-12-Fast Gradient Descent

Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be

3 0.76038212 197 hunch net-2006-07-17-A Winner

Introduction: Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. This bears a strong resemblance to a neural network. The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).

4 0.65718287 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

Introduction: The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page ). Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics , including a response by Yoav Freund and Robert Schapire . I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties. Comparing wit

5 0.62080705 258 hunch net-2007-08-12-Exponentiated Gradient

Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve

6 0.61972779 205 hunch net-2006-09-07-Objective and subjective interpretations of probability

7 0.52128929 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

8 0.45824382 411 hunch net-2010-09-21-Regretting the dead

9 0.45474672 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

10 0.4492068 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

11 0.42941147 186 hunch net-2006-06-24-Online convex optimization at COLT

12 0.42731392 348 hunch net-2009-04-02-Asymmophobia

13 0.41212678 420 hunch net-2010-12-26-NIPS 2010

14 0.39561236 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

15 0.38476825 279 hunch net-2007-12-19-Cool and interesting things seen at NIPS

16 0.38227275 138 hunch net-2005-12-09-Some NIPS papers

17 0.37350672 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

18 0.3716917 177 hunch net-2006-05-05-An ICML reject

19 0.36882713 87 hunch net-2005-06-29-Not EM for clustering at COLT

20 0.36628801 385 hunch net-2009-12-27-Interesting things at NIPS 2009


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(27, 0.168), (38, 0.036), (53, 0.062), (55, 0.046), (94, 0.066), (98, 0.516)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.88952219 167 hunch net-2006-03-27-Gradients everywhere

Introduction: One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. Sometimes there are clever approximations of the gradient. ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. ( Drew Bagnell ) Even give

2 0.84078526 322 hunch net-2008-10-20-New York’s ML Day

Introduction: I’m not as naturally exuberant as Muthu 2 or David about CS/Econ day, but I believe it and ML day were certainly successful. At the CS/Econ day, I particularly enjoyed Toumas Sandholm’s talk which showed a commanding depth of understanding and application in automated auctions. For the machine learning day, I enjoyed several talks and posters (I better, I helped pick them.). What stood out to me was number of people attending: 158 registered, a level qualifying as “scramble to find seats”. My rule of thumb for workshops/conferences is that the number of attendees is often something like the number of submissions. That isn’t the case here, where there were just 4 invited speakers and 30-or-so posters. Presumably, the difference is due to a critical mass of Machine Learning interested people in the area and the ease of their attendance. Are there other areas where a local Machine Learning day would fly? It’s easy to imagine something working out in the San Franci

3 0.81621683 211 hunch net-2006-10-02-$1M Netflix prediction contest

Introduction: Netflix is running a contest to improve recommender prediction systems. A 10% improvement over their current system yields a $1M prize. Failing that, the best smaller improvement yields a smaller $50K prize. This contest looks quite real, and the $50K prize money is almost certainly achievable with a bit of thought. The contest also comes with a dataset which is apparently 2 orders of magnitude larger than any other public recommendation system datasets.

4 0.76449442 231 hunch net-2007-02-10-Best Practices for Collaboration

Introduction: Many people, especially students, haven’t had an opportunity to collaborate with other researchers. Collaboration, especially with remote people can be tricky. Here are some observations of what has worked for me on collaborations involving a few people. Travel and Discuss Almost all collaborations start with in-person discussion. This implies that travel is often necessary. We can hope that in the future we’ll have better systems for starting collaborations remotely (such as blogs), but we aren’t quite there yet. Enable your collaborator . A collaboration can fall apart because one collaborator disables another. This sounds stupid (and it is), but it’s far easier than you might think. Avoid Duplication . Discovering that you and a collaborator have been editing the same thing and now need to waste time reconciling changes is annoying. The best way to avoid this to be explicit about who has write permission to what. Most of the time, a write lock is held for the e

5 0.7497201 111 hunch net-2005-09-12-Fast Gradient Descent

Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be

6 0.44265723 379 hunch net-2009-11-23-ICML 2009 Workshops (and Tutorials)

7 0.3544558 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

8 0.35282284 347 hunch net-2009-03-26-Machine Learning is too easy

9 0.35250604 131 hunch net-2005-11-16-The Everything Ensemble Edge

10 0.35182551 95 hunch net-2005-07-14-What Learning Theory might do

11 0.35174295 351 hunch net-2009-05-02-Wielding a New Abstraction

12 0.35141712 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

13 0.35077667 41 hunch net-2005-03-15-The State of Tight Bounds

14 0.35045806 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

15 0.35013384 14 hunch net-2005-02-07-The State of the Reduction

16 0.34973761 258 hunch net-2007-08-12-Exponentiated Gradient

17 0.34970683 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

18 0.34924981 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

19 0.34887493 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

20 0.34819397 370 hunch net-2009-09-18-Necessary and Sufficient Research