hunch_net hunch_net-2007 hunch_net-2007-258 knowledge-graph by maker-knowledge-mining

258 hunch net-2007-08-12-Exponentiated Gradient

meta infos for this blog

Source: html

Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 It’s a fine algorithm which has a remarkable theoretical statement accompanying it. [sent-2, score-0.562]

2 Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). [sent-4, score-0.264]

3 On each round, the following happens: The world reveals a set of features x in {0,1} n . [sent-6, score-0.402]

4 In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). [sent-7, score-1.042]

5 EG updates the weights according to w i <- w i e -2 c (y’ – y)x i . [sent-11, score-0.335]

6 The 4th line is what gives EG it’s name—exponent of the negative gradient (of squared loss in this case). [sent-14, score-0.488]

7 The accompanying theoretical statement (in english), is that for all sequences of observations, this algorithm does nearly as well in squared loss as the best convex combination of the features with a regret that is only logarithmic in the number of features. [sent-15, score-1.597]

8 The mathematical theorem statement is: For all T for all T-length sequences of observations , Sum T t=1 (y’ t – y t ) 2 <= min probability distributions q (2/(2-c)) Sum T t=1 (q. [sent-16, score-0.417]

9 x t – y) 2 + KL(q||p) / c Here KL(q||p) = Sum i q i ln (q i / p i ) is the KL-divergence between the distribution q that you compare with and the distribution p that you start with. [sent-17, score-0.245]

10 Tong Zhang likes to think of this algorithm as the stochastic gradient descent with entropy regularization, which makes it clear that when used as an online optimization algorithm, c should be gradually decreased in value. [sent-20, score-0.75]

11 Exponentiated Gradient competes with the best convex predictor with no caveats about how the world produces the data. [sent-22, score-0.478]

12 This is pretty remarkable—it’s much stronger than competing with the best single feature, as many other online adversary algorithms do. [sent-23, score-0.402]

13 The notion of competition is logarithmic in the number of features so the algorithm can give good performance even when the number of features is extraordinarily large compared to the number of examples. [sent-25, score-0.962]

14 In contrast gradient descent style algorithms might need a number of examples similar to the number of features. [sent-26, score-0.742]

15 (The same paper also analyzes regret for gradient descent. [sent-27, score-0.345]

16 ) We know from a learning reductions point of view that the ability to optimize squared loss well implies that ability to optimize many other losses fairly well. [sent-28, score-0.384]

17 A convex combination is not as powerful a representation as a linear combination. [sent-30, score-0.251]

18 For more general combinations, the story with respect to gradient descent becomes more mixed. [sent-32, score-0.432]

19 When the best predictor is a linear combination of many features, gradient descent style learning may be superior to EG. [sent-33, score-0.82]

20 For example, this paper shows that EG style updates can converge very quickly compared to other methods. [sent-35, score-0.288]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('eg', 0.405), ('gradient', 0.274), ('features', 0.183), ('sum', 0.173), ('weights', 0.162), ('descent', 0.158), ('statement', 0.152), ('combinations', 0.144), ('adversary', 0.143), ('convex', 0.137), ('accompanying', 0.133), ('exponentiated', 0.133), ('values', 0.132), ('squared', 0.128), ('kl', 0.126), ('experts', 0.124), ('feature', 0.123), ('sequences', 0.115), ('remarkable', 0.115), ('combination', 0.114), ('logarithmic', 0.111), ('reveals', 0.111), ('world', 0.108), ('updates', 0.105), ('style', 0.104), ('number', 0.103), ('best', 0.102), ('algorithm', 0.097), ('online', 0.091), ('distribution', 0.091), ('loss', 0.086), ('optimize', 0.085), ('constant', 0.083), ('observations', 0.081), ('compared', 0.079), ('regret', 0.071), ('theorem', 0.069), ('predictor', 0.068), ('according', 0.068), ('happens', 0.067), ('decreased', 0.067), ('warmuth', 0.067), ('pretty', 0.066), ('theoretical', 0.065), ('english', 0.063), ('relaxations', 0.063), ('likes', 0.063), ('competes', 0.063), ('dot', 0.063), ('ln', 0.063)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 258 hunch net-2007-08-12-Exponentiated Gradient

2 0.25377661 111 hunch net-2005-09-12-Fast Gradient Descent

Introduction: Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD). Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be

3 0.18473278 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

Introduction: “Deep learning” is used to describe learning architectures which have significant depth (as a circuit). One claim is that shallow architectures (one or two layers) can not concisely represent some functions while a circuit with more depth can concisely represent these same functions. Proving lower bounds on the size of a circuit is substantially harder than upper bounds (which are constructive), but some results are known. Luca Trevisan ‘s class notes detail how XOR is not concisely representable by “AC0″ (= constant depth unbounded fan-in AND, OR, NOT gates). This doesn’t quite prove that depth is necessary for the representations commonly used in learning (such as a thresholded weighted sum), but it is strongly suggestive that this is so. Examples like this are a bit disheartening because existing algorithms for deep learning (deep belief nets, gradient descent on deep neural networks, and a perhaps decision trees depending on who you ask) can’t learn XOR very easily.

4 0.16335315 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

Introduction: Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f w (x)= sum i w i x i where optimizing squared loss (y-f w (x)) 2 over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f w (x) – 0.5)) . While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices: Log loss. If we confine the prediction to [0,1] , we can treat it as a predicted probability that the label is 1 , and measure loss according to log 1/p’(y|x) where p’(y|x) is the predicted probability of the observed label. A standard method for confi

5 0.16065449 235 hunch net-2007-03-03-All Models of Learning have Flaws

Introduction: Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning. The point here is not simply “woe unto us”. There are several implications which seem important. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students. Algorithms which conform to multiple approaches c

6 0.15550846 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

7 0.15472381 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

8 0.15329009 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

9 0.1503589 8 hunch net-2005-02-01-NIPS: Online Bayes

10 0.14735511 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

11 0.14480372 274 hunch net-2007-11-28-Computational Consequences of Classification

12 0.13170695 167 hunch net-2006-03-27-Gradients everywhere

13 0.13134663 186 hunch net-2006-06-24-Online convex optimization at COLT

14 0.13042329 332 hunch net-2008-12-23-Use of Learning Theory

15 0.12823333 388 hunch net-2010-01-24-Specializations of the Master Problem

16 0.12671365 267 hunch net-2007-10-17-Online as the new adjective

17 0.12087935 133 hunch net-2005-11-28-A question of quantification

18 0.11919262 245 hunch net-2007-05-12-Loss Function Semantics

19 0.11752553 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

20 0.11671025 279 hunch net-2007-12-19-Cool and interesting things seen at NIPS

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.244), (1, 0.192), (2, 0.053), (3, -0.11), (4, -0.01), (5, 0.023), (6, -0.139), (7, -0.04), (8, -0.031), (9, 0.128), (10, 0.05), (11, -0.027), (12, 0.055), (13, -0.131), (14, 0.101), (15, 0.115), (16, 0.011), (17, -0.036), (18, 0.046), (19, 0.036), (20, -0.066), (21, 0.017), (22, 0.059), (23, -0.028), (24, -0.024), (25, 0.032), (26, -0.048), (27, -0.006), (28, -0.043), (29, -0.036), (30, -0.087), (31, 0.016), (32, -0.126), (33, -0.03), (34, -0.136), (35, 0.018), (36, -0.09), (37, 0.045), (38, 0.061), (39, 0.02), (40, -0.036), (41, 0.002), (42, 0.059), (43, -0.066), (44, -0.064), (45, 0.053), (46, -0.005), (47, 0.017), (48, 0.054), (49, -0.029)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97607332 258 hunch net-2007-08-12-Exponentiated Gradient

2 0.74453592 111 hunch net-2005-09-12-Fast Gradient Descent

3 0.70353675 167 hunch net-2006-03-27-Gradients everywhere

Introduction: One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance. It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down. Sometimes there are clever approximations of the gradient. ( Simon Osindero ) Sometimes we can compute constrained gradients via iterated gradient/project steps. ( Ben Taskar ) Sometimes we can compute gradients anyways over mildly nondifferentiable functions. ( Drew Bagnell ) Even give

4 0.67828715 308 hunch net-2008-07-06-To Dual or Not

Introduction: Yoram and Shai ‘s online learning tutorial at ICML brings up a question for me, “Why use the dual ?” The basic setting is learning a weight vector w i so that the function f(x)= sum i w i x i optimizes some convex loss function. The functional view of the dual is that instead of (or in addition to) keeping track of w i over the feature space, you keep track of a vector a j over the examples and define w i = sum j a j x ji . The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful. Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpfu

5 0.67549551 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

Introduction: The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page ). Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics , including a response by Yoav Freund and Robert Schapire . I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties. Comparing wit

6 0.63418984 197 hunch net-2006-07-17-A Winner

7 0.62735456 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

8 0.60793597 186 hunch net-2006-06-24-Online convex optimization at COLT

9 0.59725869 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

10 0.59683901 133 hunch net-2005-11-28-A question of quantification

11 0.59042782 348 hunch net-2009-04-02-Asymmophobia

12 0.55114561 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

13 0.54384905 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

14 0.52880377 23 hunch net-2005-02-19-Loss Functions for Discriminative Training of Energy-Based Models

15 0.52443308 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

16 0.51907665 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

17 0.5158785 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

18 0.51553285 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

19 0.5151841 253 hunch net-2007-07-06-Idempotent-capable Predictors

20 0.50983828 28 hunch net-2005-02-25-Problem: Online Learning

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(9, 0.032), (10, 0.027), (27, 0.297), (38, 0.037), (53, 0.074), (55, 0.064), (59, 0.188), (77, 0.039), (79, 0.016), (84, 0.014), (94, 0.102), (95, 0.029)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.93497169 449 hunch net-2011-11-26-Giving Thanks

Introduction: Thanksgiving is perhaps my favorite holiday, because pausing your life and giving thanks provides a needed moment of perspective. As a researcher, I am most thankful for my education, without which I could not function. I want to share this, because it provides some sense of how a researcher starts. My long term memory seems to function particularly well, which makes any education I get is particularly useful. I am naturally obsessive, which makes me chase down details until I fully understand things. Natural obsessiveness can go wrong, of course, but it’s a great ally when you absolutely must get things right. My childhood was all in one hometown, which was a conscious sacrifice on the part of my father, implying disruptions from moving around were eliminated. I’m not sure how important this was since travel has it’s own benefits, but it bears thought. I had several great teachers in grade school, and naturally gravitated towards teachers over classmates, as they seemed

same-blog 2 0.90855616 258 hunch net-2007-08-12-Exponentiated Gradient

3 0.85682887 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

Introduction: Muthu invited me to the workshop on algorithms in the field , with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment. There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n 2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vecto

4 0.85402757 347 hunch net-2009-03-26-Machine Learning is too easy

Introduction: One of the remarkable things about machine learning is how diverse it is. The viewpoints of Bayesian learning, reinforcement learning, graphical models, supervised learning, unsupervised learning, genetic programming, etc… share little enough overlap that many people can and do make their careers within one without touching, or even necessarily understanding the others. There are two fundamental reasons why this is possible. For many problems, many approaches work in the sense that they do something useful. This is true empirically, where for many problems we can observe that many different approaches yield better performance than any constant predictor. It’s also true in theory, where we know that for any set of predictors representable in a finite amount of RAM, minimizing training error over the set of predictors does something nontrivial when there are a sufficient number of examples. There is nothing like a unifying problem defining the field. In many other areas there

5 0.85314852 235 hunch net-2007-03-03-All Models of Learning have Flaws

6 0.85166693 351 hunch net-2009-05-02-Wielding a New Abstraction

7 0.8514601 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

8 0.85100013 360 hunch net-2009-06-15-In Active Learning, the question changes

9 0.84805644 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

10 0.84726912 311 hunch net-2008-07-26-Compositional Machine Learning Algorithm Design

11 0.84556878 95 hunch net-2005-07-14-What Learning Theory might do

12 0.84541267 41 hunch net-2005-03-15-The State of Tight Bounds

13 0.84480482 79 hunch net-2005-06-08-Question: “When is the right time to insert the loss function?”

14 0.84336954 14 hunch net-2005-02-07-The State of the Reduction

15 0.84249687 388 hunch net-2010-01-24-Specializations of the Master Problem

16 0.8423965 359 hunch net-2009-06-03-Functionally defined Nonlinear Dynamic Models

17 0.84197813 259 hunch net-2007-08-19-Choice of Metrics

18 0.84188312 230 hunch net-2007-02-02-Thoughts regarding “Is machine learning different from statistics?”

19 0.84060287 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

20 0.84058648 432 hunch net-2011-04-20-The End of the Beginning of Active Learning