hunch_net hunch_net-2008 hunch_net-2008-308 knowledge-graph by maker-knowledge-mining

308 hunch net-2008-07-06-To Dual or Not


meta infos for this blog

Source: html

Introduction: Yoram and Shai ‘s online learning tutorial at ICML brings up a question for me, “Why use the dual ?” The basic setting is learning a weight vector w i so that the function f(x)= sum i w i x i optimizes some convex loss function. The functional view of the dual is that instead of (or in addition to) keeping track of w i over the feature space, you keep track of a vector a j over the examples and define w i = sum j a j x ji . The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful. Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpfu


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Yoram and Shai ‘s online learning tutorial at ICML brings up a question for me, “Why use the dual ? [sent-1, score-0.959]

2 ” The basic setting is learning a weight vector w i so that the function f(x)= sum i w i x i optimizes some convex loss function. [sent-2, score-0.846]

3 The functional view of the dual is that instead of (or in addition to) keeping track of w i over the feature space, you keep track of a vector a j over the examples and define w i = sum j a j x ji . [sent-3, score-1.897]

4 The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. [sent-4, score-1.47]

5 The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. [sent-5, score-1.344]

6 I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful. [sent-6, score-0.876]

7 online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpful in dealing with noisy labels. [sent-9, score-1.274]

8 Rates One annoyance of working in the primal space with a gradient descent style algorithm is that finding the right learning rate can be a bit tricky. [sent-10, score-0.356]

9 Kernelization Instead of explicitly computing the weight vector, the representation of the solution can be left in terms of a i , which is handy when the weight vector is only implicitly defined. [sent-11, score-1.323]

10 A substantial drawback of dual analysis is that it doesn’t seem to apply well in nonconvex situations. [sent-12, score-0.96]

11 There is a reasonable (but not bullet-proof) argument that learning over nonconvex functions is unavoidable in solving some problems. [sent-13, score-0.365]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('dual', 0.62), ('vector', 0.288), ('weight', 0.27), ('nonconvex', 0.212), ('representation', 0.164), ('online', 0.135), ('sum', 0.128), ('track', 0.122), ('tutorial', 0.122), ('view', 0.114), ('duality', 0.106), ('annoyance', 0.106), ('font', 0.098), ('primal', 0.098), ('yoram', 0.098), ('space', 0.094), ('optimizes', 0.093), ('deriving', 0.093), ('unified', 0.093), ('unavoidable', 0.093), ('examples', 0.092), ('unnecessary', 0.089), ('instead', 0.083), ('brings', 0.082), ('shai', 0.079), ('handy', 0.075), ('implicitly', 0.075), ('functional', 0.075), ('noisy', 0.075), ('operating', 0.072), ('keeping', 0.072), ('convex', 0.067), ('drawback', 0.067), ('dealing', 0.067), ('algorithmic', 0.066), ('allowed', 0.065), ('rates', 0.064), ('noise', 0.064), ('define', 0.063), ('keep', 0.063), ('left', 0.062), ('computing', 0.062), ('gives', 0.061), ('apply', 0.061), ('argument', 0.06), ('descent', 0.058), ('explicitly', 0.057), ('individual', 0.056), ('suggests', 0.056), ('addition', 0.055)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 308 hunch net-2008-07-06-To Dual or Not

Introduction: Yoram and Shai ‘s online learning tutorial at ICML brings up a question for me, “Why use the dual ?” The basic setting is learning a weight vector w i so that the function f(x)= sum i w i x i optimizes some convex loss function. The functional view of the dual is that instead of (or in addition to) keeping track of w i over the feature space, you keep track of a vector a j over the examples and define w i = sum j a j x ji . The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful. Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpfu

2 0.16051762 185 hunch net-2006-06-16-Regularization = Robustness

Introduction: The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math, argmax_p H(p) s.t. E_p[f_i] = c_i is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints). A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the “geometric” version here . In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the expone

3 0.10934176 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

Introduction: Let’s suppose that we are trying to create a general purpose machine learning box. The box is fed many examples of the function it is supposed to learn and (hopefully) succeeds. To date, most such attempts to produce a box of this form take a vector as input. The elements of the vector might be bits, real numbers, or ‘categorical’ data (a discrete set of values). On the other hand, there are a number of succesful applications of machine learning which do not seem to use a vector representation as input. For example, in vision, convolutional neural networks have been used to solve several vision problems. The input to the convolutional neural network is essentially the raw camera image as a matrix . In learning for natural languages, several people have had success on problems like parts-of-speech tagging using predictors restricted to a window surrounding the word to be predicted. A vector window and a matrix both imply a notion of locality which is being actively and

4 0.10833779 258 hunch net-2007-08-12-Exponentiated Gradient

Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve

5 0.10292365 267 hunch net-2007-10-17-Online as the new adjective

Introduction: Online learning is in vogue, which means we should expect to see in the near future: Online boosting. Online decision trees. Online SVMs. (actually, we’ve already seen) Online deep learning. Online parallel learning. etc… There are three fundamental drivers of this trend. Increasing size of datasets makes online algorithms attractive. Online learning can simply be more efficient than batch learning. Here is a picture from a class on online learning: The point of this picture is that even in 3 dimensions and even with linear constraints, finding the minima of a set in an online fashion can be typically faster than finding the minima in a batch fashion. To see this, note that there is a minimal number of gradient updates (i.e. 2) required in order to reach the minima in the typical case. Given this, it’s best to do these updates as quickly as possible, which implies doing the first update online (i.e. before seeing all the examples) is preferred. Note

6 0.098593056 252 hunch net-2007-07-01-Watchword: Online Learning

7 0.097999342 327 hunch net-2008-11-16-Observations on Linearity for Reductions to Regression

8 0.092677757 385 hunch net-2009-12-27-Interesting things at NIPS 2009

9 0.080832295 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

10 0.080063507 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

11 0.078183673 235 hunch net-2007-03-03-All Models of Learning have Flaws

12 0.076478966 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

13 0.075874448 229 hunch net-2007-01-26-Parallel Machine Learning Problems

14 0.074189261 9 hunch net-2005-02-01-Watchword: Loss

15 0.072959304 128 hunch net-2005-11-05-The design of a computing cluster

16 0.070010826 111 hunch net-2005-09-12-Fast Gradient Descent

17 0.068919733 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

18 0.068554327 149 hunch net-2006-01-18-Is Multitask Learning Black-Boxable?

19 0.068393469 67 hunch net-2005-05-06-Don’t mix the solution into the problem

20 0.06812083 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.146), (1, 0.084), (2, 0.005), (3, -0.046), (4, 0.035), (5, 0.009), (6, -0.095), (7, -0.041), (8, -0.03), (9, 0.088), (10, -0.004), (11, -0.035), (12, -0.011), (13, -0.059), (14, 0.061), (15, 0.041), (16, -0.027), (17, -0.052), (18, 0.009), (19, 0.034), (20, -0.01), (21, 0.006), (22, 0.022), (23, 0.054), (24, 0.028), (25, 0.024), (26, 0.031), (27, 0.017), (28, -0.018), (29, 0.008), (30, -0.034), (31, -0.023), (32, -0.012), (33, 0.013), (34, -0.056), (35, 0.015), (36, 0.043), (37, 0.039), (38, -0.048), (39, 0.028), (40, 0.006), (41, 0.029), (42, 0.036), (43, 0.011), (44, -0.142), (45, -0.052), (46, -0.041), (47, 0.04), (48, -0.011), (49, 0.025)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.94751722 308 hunch net-2008-07-06-To Dual or Not

Introduction: Yoram and Shai ‘s online learning tutorial at ICML brings up a question for me, “Why use the dual ?” The basic setting is learning a weight vector w i so that the function f(x)= sum i w i x i optimizes some convex loss function. The functional view of the dual is that instead of (or in addition to) keeping track of w i over the feature space, you keep track of a vector a j over the examples and define w i = sum j a j x ji . The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful. Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpfu

2 0.68382984 258 hunch net-2007-08-12-Exponentiated Gradient

Introduction: The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it. The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n) , (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens: The world reveals a set of features x in {0,1} n . In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts). EG makes a prediction according to y’ = w . x (dot product). The world reve

3 0.66306287 267 hunch net-2007-10-17-Online as the new adjective

Introduction: Online learning is in vogue, which means we should expect to see in the near future: Online boosting. Online decision trees. Online SVMs. (actually, we’ve already seen) Online deep learning. Online parallel learning. etc… There are three fundamental drivers of this trend. Increasing size of datasets makes online algorithms attractive. Online learning can simply be more efficient than batch learning. Here is a picture from a class on online learning: The point of this picture is that even in 3 dimensions and even with linear constraints, finding the minima of a set in an online fashion can be typically faster than finding the minima in a batch fashion. To see this, note that there is a minimal number of gradient updates (i.e. 2) required in order to reach the minima in the typical case. Given this, it’s best to do these updates as quickly as possible, which implies doing the first update online (i.e. before seeing all the examples) is preferred. Note

4 0.63975757 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr

5 0.62747502 252 hunch net-2007-07-01-Watchword: Online Learning

Introduction: It turns out that many different people use the term “Online Learning”, and often they don’t have the same definition in mind. Here’s a list of the possibilities I know of. Online Information Setting Online learning refers to a problem in which unlabeled data comes, a prediction is made, and then feedback is acquired. Online Adversarial Setting Online learning refers to algorithms in the Online Information Setting which satisfy guarantees of the form: “For all possible sequences of observations, the algorithim has regret at most log ( number of strategies) with respect to the best strategy in a set.” This is sometimes called online learning with experts. Online Optimization Constraint Online learning refers to optimizing a predictor via a learning algorithm tunes parameters on a per-example basis. This may or may not be applied in the Online Information Setting, and the strategy may or may not satisfy Adversarial setting theory. Online Computational Constra

6 0.61259407 385 hunch net-2009-12-27-Interesting things at NIPS 2009

7 0.6056779 348 hunch net-2009-04-02-Asymmophobia

8 0.57919985 109 hunch net-2005-09-08-Online Learning as the Mathematics of Accountability

9 0.55602676 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

10 0.54533666 185 hunch net-2006-06-16-Regularization = Robustness

11 0.54298884 346 hunch net-2009-03-18-Parallel ML primitives

12 0.51736623 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

13 0.51631331 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

14 0.51552153 186 hunch net-2006-06-24-Online convex optimization at COLT

15 0.5153802 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

16 0.51409191 451 hunch net-2011-12-13-Vowpal Wabbit version 6.1 & the NIPS tutorial

17 0.51368368 6 hunch net-2005-01-27-Learning Complete Problems

18 0.51280159 253 hunch net-2007-07-06-Idempotent-capable Predictors

19 0.51245838 179 hunch net-2006-05-16-The value of the orthodox view of Boosting

20 0.50032437 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(27, 0.76), (38, 0.047), (51, 0.015), (53, 0.026), (55, 0.02), (95, 0.015)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99929577 308 hunch net-2008-07-06-To Dual or Not

Introduction: Yoram and Shai ‘s online learning tutorial at ICML brings up a question for me, “Why use the dual ?” The basic setting is learning a weight vector w i so that the function f(x)= sum i w i x i optimizes some convex loss function. The functional view of the dual is that instead of (or in addition to) keeping track of w i over the feature space, you keep track of a vector a j over the examples and define w i = sum j a j x ji . The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful. Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpfu

2 0.99807155 274 hunch net-2007-11-28-Computational Consequences of Classification

Introduction: In the regression vs classification debate , I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints

3 0.99698895 247 hunch net-2007-06-14-Interesting Papers at COLT 2007

Introduction: Here are two papers that seem particularly interesting at this year’s COLT. Gilles Blanchard and François Fleuret , Occam’s Hammer . When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight . A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper. Adam Tauman Kalai . Learning Nested Halfspaces and Uphill Decision Trees . Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily ch

4 0.99682778 166 hunch net-2006-03-24-NLPers

Introduction: Hal Daume has started the NLPers blog to discuss learning for language problems.

5 0.99682778 246 hunch net-2007-06-13-Not Posting

Introduction: If you have been disappointed by the lack of a post for the last month, consider contributing your own (I’ve been busy+uninspired). Also, keep in mind that there is a community of machine learning blogs (see the sidebar).

6 0.99682778 418 hunch net-2010-12-02-Traffic Prediction Problem

7 0.99445659 400 hunch net-2010-06-13-The Good News on Exploration and Learning

8 0.99445331 288 hunch net-2008-02-10-Complexity Illness

9 0.99364501 172 hunch net-2006-04-14-JMLR is a success

10 0.99282664 245 hunch net-2007-05-12-Loss Function Semantics

11 0.98845142 45 hunch net-2005-03-22-Active learning

12 0.98173159 9 hunch net-2005-02-01-Watchword: Loss

13 0.97768801 352 hunch net-2009-05-06-Machine Learning to AI

14 0.97670346 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification

15 0.96888512 304 hunch net-2008-06-27-Reviewing Horror Stories

16 0.96711236 196 hunch net-2006-07-13-Regression vs. Classification as a Primitive

17 0.95503825 483 hunch net-2013-06-10-The Large Scale Learning class notes

18 0.94892031 244 hunch net-2007-05-09-The Missing Bound

19 0.94219232 293 hunch net-2008-03-23-Interactive Machine Learning

20 0.93952584 133 hunch net-2005-11-28-A question of quantification