hunch_net hunch_net-2005 hunch_net-2005-111 knowledge-graph by maker-knowledge-mining

111 hunch net-2005-09-12-Fast Gradient Descent


meta infos for this blog

Source: html

Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). [sent-1, score-1.14]

2 Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. [sent-2, score-1.833]

3 Gradient descent does not necessarily produce easily reproduced results. [sent-4, score-0.686]

4 Typical algorithms start with “set the initial parameters to small random values”. [sent-5, score-0.225]

5 The design of the representation that gradient descent is applied to is often nontrivial. [sent-6, score-0.955]

6 In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. [sent-7, score-0.465]

7 Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. [sent-9, score-1.105]

8 What exactly this step size should be is unclear. [sent-10, score-0.57]

9 Many people have developed many algorithms for adjusting the step size (and to some extent the step direction). [sent-11, score-0.911]

10 Unfortunately, many of the more sophisticated algorithms are not robust to noise, scale badly with the number of parameters (Anything worse than O(n) is unacceptable for big applications) or both. [sent-12, score-0.702]

11 Consequently, many people simply use gradient descent where the step size is adjusted by a simple momentum heuristic. [sent-13, score-1.57]

12 Many people would add point (4): gradient descent on many architectures does not result in a global optima. [sent-14, score-1.429]

13 The goal is good performance on future examples in learning rather than achieving a global optima on the training set. [sent-16, score-0.333]

14 It is an O(n) algorithm for gradient descent that can compete with the sophisticed methods where the sophisticated methods work but remains fairly robust to noise. [sent-18, score-1.6]

15 Exactly how well it addresses point (3) is not entirely clear, but a few interesting problems have been solved with the algorithm, and perhaps we will see more evidence in the near future. [sent-19, score-0.225]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('gradient', 0.491), ('descent', 0.464), ('step', 0.214), ('smd', 0.212), ('size', 0.181), ('exactly', 0.175), ('sophisticated', 0.15), ('architectures', 0.145), ('addresses', 0.141), ('global', 0.119), ('remains', 0.113), ('robust', 0.108), ('parameters', 0.106), ('direction', 0.103), ('neural', 0.098), ('forever', 0.094), ('adjusted', 0.094), ('nic', 0.094), ('reproduced', 0.094), ('unacceptable', 0.094), ('optima', 0.087), ('point', 0.084), ('methods', 0.073), ('easily', 0.073), ('future', 0.071), ('many', 0.068), ('confusion', 0.065), ('compete', 0.065), ('developing', 0.065), ('algorithms', 0.064), ('knowing', 0.063), ('algorithm', 0.063), ('steps', 0.061), ('anything', 0.058), ('stochastic', 0.058), ('badly', 0.058), ('people', 0.058), ('goals', 0.057), ('called', 0.057), ('values', 0.057), ('extent', 0.056), ('developed', 0.056), ('noise', 0.056), ('perform', 0.056), ('consequently', 0.056), ('achieving', 0.056), ('finite', 0.055), ('necessarily', 0.055), ('initial', 0.055), ('worse', 0.054)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 111 hunch net-2005-09-12-Fast Gradient Descent

Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be

2 0.25377661 258 hunch net-2007-08-12-Exponentiated Gradient

Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve

3 0.20667326 167 hunch net-2006-03-27-Gradients everywhere

Introduction: One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. Sometimes there are clever approximations of the gradient. ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. ( Drew Bagnell ) Even give

4 0.2052519 197 hunch net-2006-07-17-A Winner

Introduction: Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. This bears a strong resemblance to a neural network. The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).

5 0.19082719 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

Introduction: We are releasing the Vowpal Wabbit (Fast Online Learning) code as open source under a BSD (revised) license. This is a project at Yahoo! Research to build a useful large scale learning algorithm which Lihong Li , Alex Strehl , and I have been working on. To appreciate the meaning of “large”, it’s useful to define “small” and “medium”. A “small” supervised learning problem is one where a human could use a labeled dataset and come up with a reasonable predictor. A “medium” supervised learning problem dataset fits into the RAM of a modern desktop computer. A “large” supervised learning problem is one which does not fit into the RAM of a normal machine. VW tackles large scale learning problems by this definition of large. I’m not aware of any other open source Machine Learning tools which can handle this scale (although they may exist). A few close ones are: IBM’s Parallel Machine Learning Toolbox isn’t quite open source . The approach used by this toolbox is essenti

6 0.18106331 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

7 0.17234617 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

8 0.16683687 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

9 0.158875 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

10 0.1503468 235 hunch net-2007-03-03-All Models of Learning have Flaws

11 0.14265883 186 hunch net-2006-06-24-Online convex optimization at COLT

12 0.1117909 441 hunch net-2011-08-15-Vowpal Wabbit 6.0

13 0.10733838 346 hunch net-2009-03-18-Parallel ML primitives

14 0.10702182 388 hunch net-2010-01-24-Specializations of the Master Problem

15 0.10636838 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

16 0.098253794 279 hunch net-2007-12-19-Cool and interesting things seen at NIPS

17 0.098195516 438 hunch net-2011-07-11-Interesting Neural Network Papers at ICML 2011

18 0.097717106 201 hunch net-2006-08-07-The Call of the Deep

19 0.097623557 267 hunch net-2007-10-17-Online as the new adjective

20 0.095351979 177 hunch net-2006-05-05-An ICML reject


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.187), (1, 0.092), (2, -0.03), (3, -0.005), (4, 0.1), (5, 0.012), (6, -0.165), (7, -0.017), (8, -0.025), (9, 0.143), (10, -0.076), (11, -0.039), (12, 0.002), (13, -0.133), (14, 0.091), (15, 0.144), (16, 0.069), (17, -0.005), (18, -0.042), (19, 0.011), (20, -0.078), (21, 0.031), (22, 0.001), (23, -0.011), (24, -0.084), (25, 0.045), (26, -0.075), (27, 0.028), (28, -0.008), (29, -0.098), (30, -0.116), (31, 0.034), (32, -0.187), (33, -0.127), (34, -0.121), (35, -0.005), (36, -0.213), (37, -0.069), (38, 0.052), (39, 0.046), (40, 0.055), (41, -0.135), (42, 0.047), (43, -0.089), (44, 0.016), (45, 0.087), (46, 0.063), (47, -0.036), (48, 0.05), (49, 0.057)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98376119 111 hunch net-2005-09-12-Fast Gradient Descent

Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be

2 0.89777547 167 hunch net-2006-03-27-Gradients everywhere

Introduction: One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. Sometimes there are clever approximations of the gradient. ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. ( Drew Bagnell ) Even give

3 0.84369224 197 hunch net-2006-07-17-A Winner

Introduction: Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm . Some characteristics of the algorithm are: Gradient descent … on about 600 parameters … with local minima … to solve regression. This bears a strong resemblance to a neural network. The two main differences seem to be: The system has a probabilistic interpretation (which may aid design). There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).

4 0.74697232 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

Introduction: The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page ). Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics , including a response by Yoav Freund and Robert Schapire . I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties. Comparing wit

5 0.70832235 258 hunch net-2007-08-12-Exponentiated Gradient

Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve

6 0.57303762 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

7 0.54787046 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

8 0.54547822 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

9 0.46782523 186 hunch net-2006-06-24-Online convex optimization at COLT

10 0.46665359 348 hunch net-2009-04-02-Asymmophobia

11 0.45999226 205 hunch net-2006-09-07-Objective and subjective interpretations of probability

12 0.45765463 346 hunch net-2009-03-18-Parallel ML primitives

13 0.4511528 138 hunch net-2005-12-09-Some NIPS papers

14 0.44725269 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

15 0.43222627 441 hunch net-2011-08-15-Vowpal Wabbit 6.0

16 0.43074065 32 hunch net-2005-02-27-Antilearning: When proximity goes bad

17 0.42082062 131 hunch net-2005-11-16-The Everything Ensemble Edge

18 0.42024153 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

19 0.41910881 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

20 0.41545707 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.031), (27, 0.186), (38, 0.044), (53, 0.125), (55, 0.027), (94, 0.063), (95, 0.065), (98, 0.346)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.94313049 167 hunch net-2006-03-27-Gradients everywhere

Introduction: One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. Sometimes there are clever approximations of the gradient. ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. ( Drew Bagnell ) Even give

2 0.91536665 322 hunch net-2008-10-20-New York’s ML Day

Introduction: I’m not as naturally exuberant as Muthu 2 or David about CS/Econ day, but I believe it and ML day were certainly successful. At the CS/Econ day, I particularly enjoyed Toumas Sandholm’s talk which showed a commanding depth of understanding and application in automated auctions. For the machine learning day, I enjoyed several talks and posters (I better, I helped pick them.). What stood out to me was number of people attending: 158 registered, a level qualifying as “scramble to find seats”. My rule of thumb for workshops/conferences is that the number of attendees is often something like the number of submissions. That isn’t the case here, where there were just 4 invited speakers and 30-or-so posters. Presumably, the difference is due to a critical mass of Machine Learning interested people in the area and the ease of their attendance. Are there other areas where a local Machine Learning day would fly? It’s easy to imagine something working out in the San Franci

same-blog 3 0.88778782 111 hunch net-2005-09-12-Fast Gradient Descent

Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be

4 0.87032104 211 hunch net-2006-10-02-$1M Netflix prediction contest

Introduction: Netflix is running a contest to improve recommender prediction systems. A 10% improvement over their current system yields a $1M prize. Failing that, the best smaller improvement yields a smaller $50K prize. This contest looks quite real, and the $50K prize money is almost certainly achievable with a bit of thought. The contest also comes with a dataset which is apparently 2 orders of magnitude larger than any other public recommendation system datasets.

5 0.86902082 231 hunch net-2007-02-10-Best Practices for Collaboration

Introduction: Many people, especially students, haven’t had an opportunity to collaborate with other researchers. Collaboration, especially with remote people can be tricky. Here are some observations of what has worked for me on collaborations involving a few people. Travel and Discuss Almost all collaborations start with in-person discussion. This implies that travel is often necessary. We can hope that in the future we’ll have better systems for starting collaborations remotely (such as blogs), but we aren’t quite there yet. Enable your collaborator . A collaboration can fall apart because one collaborator disables another. This sounds stupid (and it is), but it’s far easier than you might think. Avoid Duplication . Discovering that you and a collaborator have been editing the same thing and now need to waste time reconciling changes is annoying. The best way to avoid this to be explicit about who has write permission to what. Most of the time, a write lock is held for the e

6 0.56523883 478 hunch net-2013-01-07-NYU Large Scale Machine Learning Class

7 0.55637002 12 hunch net-2005-02-03-Learning Theory, by assumption

8 0.55478483 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

9 0.55391419 379 hunch net-2009-11-23-ICML 2009 Workshops (and Tutorials)

10 0.55354214 370 hunch net-2009-09-18-Necessary and Sufficient Research

11 0.55234993 201 hunch net-2006-08-07-The Call of the Deep

12 0.54719901 131 hunch net-2005-11-16-The Everything Ensemble Edge

13 0.54637498 19 hunch net-2005-02-14-Clever Methods of Overfitting

14 0.54549628 347 hunch net-2009-03-26-Machine Learning is too easy

15 0.54445833 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

16 0.54293364 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

17 0.54282546 41 hunch net-2005-03-15-The State of Tight Bounds

18 0.54151928 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

19 0.53901684 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

20 0.53883517 134 hunch net-2005-12-01-The Webscience Future