hunch_net hunch_net-2008 hunch_net-2008-298 knowledge-graph by maker-knowledge-mining

298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features


meta infos for this blog

Source: html

Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. [sent-1, score-0.274]

2 The central trick is converting a word (or any other parseable quantity) into a number via a hash function. [sent-2, score-1.005]

3 A central concern for this approach is collisions, which create a loss of information. [sent-4, score-0.246]

4 If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0. [sent-5, score-0.699]

5 This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. [sent-7, score-0.446]

6 It turns out that redundancy is great for dealing with collisions. [sent-8, score-0.217]

7 Alex and I worked out a couple cases, the most extreme example of which is when you simply duplicate the base word and add a symbol before hashing, creating two entries in your weight array corresponding to the same word. [sent-9, score-0.869]

8 We can ask: what is the probability(*) that there exists a word where both entries collide with an entry for some other word? [sent-10, score-0.69]

9 Plugging in numbers, we see that this implies perhaps only n=10 8 entries are required to avoid a collision. [sent-12, score-0.346]

10 This number can be further reduced to 10 7 by increasing the degree of duplication to 4 or more. [sent-13, score-0.343]

11 The above is an analysis of explicit duplication. [sent-14, score-0.126]

12 In a real world dataset with naturally redundant features, you can have the same effect implicitly, allowing for tolerance of a large number of collisions. [sent-15, score-0.177]

13 This argument is information theoretic, so it’s possible that rates of convergence to optimal predictors are slowed by collision, even if the optimal predictor is unchanged. [sent-16, score-0.303]

14 To think about this possibility, analysis particular to specific learning algorithms is necessary. [sent-17, score-0.126]

15 It turns out that many learning algorithms are inherently tolerant of a small fraction of collisions, including large margin algorithms. [sent-18, score-0.298]

16 (*) As in almost all hash function analysis, the randomization is over the choice of (random) hash function. [sent-19, score-0.71]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('entries', 0.346), ('hash', 0.27), ('word', 0.26), ('collision', 0.216), ('collisions', 0.216), ('features', 0.183), ('central', 0.162), ('trick', 0.146), ('analysis', 0.126), ('turns', 0.109), ('lookup', 0.108), ('redundancy', 0.108), ('duplication', 0.108), ('slowing', 0.108), ('kishore', 0.108), ('tolerant', 0.108), ('optimal', 0.104), ('birthday', 0.1), ('paradox', 0.1), ('symbol', 0.1), ('vocabulary', 0.1), ('index', 0.1), ('tolerance', 0.1), ('describes', 0.1), ('slowed', 0.095), ('theoretic', 0.095), ('expand', 0.09), ('converting', 0.09), ('plugging', 0.087), ('randomization', 0.087), ('land', 0.087), ('expensive', 0.087), ('create', 0.084), ('corresponding', 0.084), ('hashing', 0.084), ('nlp', 0.084), ('entry', 0.084), ('tricks', 0.084), ('function', 0.083), ('increasing', 0.081), ('margin', 0.081), ('array', 0.079), ('number', 0.077), ('implicitly', 0.077), ('alex', 0.077), ('reduced', 0.077), ('tells', 0.073), ('quantity', 0.071), ('possibility', 0.071), ('online', 0.069)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr

2 0.12791106 129 hunch net-2005-11-07-Prediction Competitions

Introduction: There are two prediction competitions currently in the air. The Performance Prediction Challenge by Isabelle Guyon . Good entries minimize a weighted 0/1 loss + the difference between a prediction of this loss and the observed truth on 5 datasets. Isabelle tells me all of the problems are “real world” and the test datasets are large enough (17K minimum) that the winner should be well determined by ability rather than luck. This is due March 1. The Predictive Uncertainty Challenge by Gavin Cawley . Good entries minimize log loss on real valued output variables for one synthetic and 3 “real” datasets related to atmospheric prediction. The use of log loss (which can be infinite and hence is never convergent) and smaller test sets of size 1K to 7K examples makes the winner of this contest more luck dependent. Nevertheless, the contest may be of some interest particularly to the branch of learning (typically Bayes learning) which prefers to optimize log loss. May the

3 0.099274404 408 hunch net-2010-08-24-Alex Smola starts a blog

Introduction: Adventures in Data Land .

4 0.096842408 259 hunch net-2007-08-19-Choice of Metrics

Introduction: How do we judge success in Machine Learning? As Aaron notes , the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are: The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss. The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity. The correct loss function is simply unknown (and perhaps unknowable, except by experimentation). In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real wor

5 0.093929425 158 hunch net-2006-02-24-A Fundamentalist Organization of Machine Learning

Introduction: There are several different flavors of Machine Learning classes. Many classes are of the ‘zoo’ sort: many different learning algorithms are presented. Others avoid the zoo by not covering the full scope of machine learning. This is my view of what makes a good machine learning class, along with why. I’d like to specifically invite comment on whether things are missing, misemphasized, or misplaced. Phase Subject Why? Introduction What is a machine learning problem? A good understanding of the characteristics of machine learning problems seems essential. Characteristics include: a data source, some hope the data is predictive, and a need for generalization. This is probably best taught in a case study manner: lay out the specifics of some problem and then ask “Is this a machine learning problem?” Introduction Machine Learning Problem Identification Identification and recognition of the type of learning problems is (obviously) a very important step i

6 0.093032345 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

7 0.089871138 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

8 0.089078784 262 hunch net-2007-09-16-Optimizing Machine Learning Programs

9 0.088777587 235 hunch net-2007-03-03-All Models of Learning have Flaws

10 0.083738297 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

11 0.083660826 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

12 0.082853921 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

13 0.082240328 444 hunch net-2011-09-07-KDD and MUCMD 2011

14 0.082064651 267 hunch net-2007-10-17-Online as the new adjective

15 0.081952482 332 hunch net-2008-12-23-Use of Learning Theory

16 0.08134789 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge

17 0.078951746 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

18 0.077262595 258 hunch net-2007-08-12-Exponentiated Gradient

19 0.07615371 281 hunch net-2007-12-21-Vowpal Wabbit Code Release

20 0.076102518 341 hunch net-2009-02-04-Optimal Proxy Loss for Classification


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.183), (1, 0.101), (2, -0.018), (3, -0.008), (4, 0.01), (5, 0.019), (6, -0.096), (7, -0.013), (8, 0.008), (9, 0.036), (10, -0.042), (11, -0.032), (12, 0.026), (13, -0.04), (14, -0.017), (15, -0.022), (16, -0.04), (17, -0.015), (18, 0.027), (19, 0.005), (20, 0.007), (21, 0.008), (22, 0.043), (23, -0.004), (24, 0.006), (25, -0.021), (26, 0.055), (27, -0.016), (28, 0.029), (29, -0.035), (30, -0.032), (31, 0.001), (32, 0.05), (33, 0.024), (34, -0.03), (35, 0.008), (36, -0.026), (37, -0.02), (38, -0.045), (39, -0.013), (40, -0.028), (41, 0.141), (42, 0.078), (43, -0.019), (44, -0.035), (45, -0.008), (46, -0.027), (47, 0.003), (48, -0.095), (49, -0.031)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96128631 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr

2 0.66015381 450 hunch net-2011-12-02-Hadoop AllReduce and Terascale Learning

Introduction: Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy. Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation. It is necessary but not sufficient to have an effective communication infrastructure. It is ne

3 0.65978485 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

Introduction: One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations. There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail. In contrast, if you go to a machine learning conference, a large number of the new algorithms are v

4 0.64261961 348 hunch net-2009-04-02-Asymmophobia

Introduction: One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms. feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance. example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach. label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg max l w l x but it occurs in many other places as well. Empirically, breaking symmetry well seems to yield great algorithms. feature asymmetry For those who like t

5 0.62064028 398 hunch net-2010-05-10-Aggregation of estimators, sparsity in high dimension and computational feasibility

Introduction: (I’m channeling for Jean-Yves Audibert here, with some minor tweaking for clarity.) Since Nemirovski’s Saint Flour lecture notes , numerous researchers have studied the following problem in least squares regression: predict as well as (MS) the best of d given functions (like in prediction with expert advice; model = finite set of d functions) (C) the best convex combination of these functions (i.e., model = convex hull of the d functions) (L) the best linear combination of these functions (i.e., model = linear span of the d functions) It is now well known (see, e.g., Sacha Tsybakov’s COLT’03 paper) that these tasks can be achieved since there exist estimators having an excess risk of order (log d)/n for (MS), min( sqrt((log d)/n), d/n ) for (C) and d/n for (L), where n is the training set size. Here, “risk” is amount of extra loss per example which may be suffered due to the choice of random sample. The practical use of these results seems rather limited to trivial

6 0.61982399 308 hunch net-2008-07-06-To Dual or Not

7 0.61855793 419 hunch net-2010-12-04-Vowpal Wabbit, version 5.0, and the second heresy

8 0.60052425 253 hunch net-2007-07-06-Idempotent-capable Predictors

9 0.59079891 262 hunch net-2007-09-16-Optimizing Machine Learning Programs

10 0.58338398 365 hunch net-2009-07-31-Vowpal Wabbit Open Source Project

11 0.58171833 444 hunch net-2011-09-07-KDD and MUCMD 2011

12 0.56113452 152 hunch net-2006-01-30-Should the Input Representation be a Vector?

13 0.55531055 136 hunch net-2005-12-07-Is the Google way the way for machine learning?

14 0.54984468 217 hunch net-2006-11-06-Data Linkage Problems

15 0.54645658 219 hunch net-2006-11-22-Explicit Randomization in Learning algorithms

16 0.54483932 397 hunch net-2010-05-02-What’s the difference between gambling and rewarding good prediction?

17 0.53954309 258 hunch net-2007-08-12-Exponentiated Gradient

18 0.53854054 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

19 0.53211558 426 hunch net-2011-03-19-The Ideal Large Scale Learning Class

20 0.52751511 300 hunch net-2008-04-30-Concerns about the Large Scale Learning Challenge


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.473), (10, 0.01), (27, 0.198), (38, 0.044), (53, 0.065), (55, 0.03), (94, 0.093)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.94395113 243 hunch net-2007-05-08-Conditional Tournaments for Multiclass to Binary

Introduction: This problem has been cracked (but not quite completely solved) by Alina , Pradeep , and I . The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here: For the single elimination tournament, we can prove that: For all multiclass problems D , for all learned binary classifiers c , the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log 2 k . Restated: reg multiclass (D,Filter_tree_test(c)) <= reg binary (Filter_tree_train(D),c) Here: Filter_tree_train(D) is the induced binary classification problem Filter_tree_test(c) is the induced multiclass classifier. reg multiclass is the multiclass regret (= difference between error rate and minim

same-blog 2 0.92088151 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

Introduction: I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text. The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down). A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n 0.5 , essentially because there are m 2 pairs. This is pretty bad, because it says that with a vocabulary of 10 5 features, you might need to have 10 10 entries in your table. It turns out that redundancy is gr

3 0.91487426 401 hunch net-2010-06-20-2010 ICML discussion site

Introduction: A substantial difficulty with the 2009 and 2008 ICML discussion system was a communication vacuum, where authors were not informed of comments, and commenters were not informed of responses to their comments without explicit monitoring. Mark Reid has setup a new discussion system for 2010 with the goal of addressing this. Mark didn’t want to make it to intrusive, so you must opt-in. As an author, find your paper and “Subscribe by email” to the comments. As a commenter, you have the option of providing an email for follow-up notification.

4 0.83864552 213 hunch net-2006-10-08-Incompatibilities between classical confidence intervals and learning.

Introduction: Classical confidence intervals satisfy a theorem of the form: For some data sources D , Pr S ~ D (f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S . The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D “. There are even some confidence intervals which do not require IID samples. Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter. There are a number of results in theoretical machine learn

5 0.80894345 31 hunch net-2005-02-26-Problem: Reductions and Relative Ranking Metrics

Introduction: This, again, is something of a research direction rather than a single problem. There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC , “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class ( google ‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these. Problem What does the ability to classify well imply about performance under these metrics? Past Work Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC. Sample complexity bounds for AROC (and here ). A paper on “ Learning to Order Things “. Difficulty Several of these may be easy. Some of them may be h

6 0.80196369 391 hunch net-2010-03-15-The Efficient Robust Conditional Probability Estimation Problem

7 0.71603233 289 hunch net-2008-02-17-The Meaning of Confidence

8 0.6485517 183 hunch net-2006-06-14-Explorations of Exploration

9 0.60283935 484 hunch net-2013-06-16-Representative Reviewing

10 0.56684411 78 hunch net-2005-06-06-Exact Online Learning for Classification

11 0.5654434 34 hunch net-2005-03-02-Prior, “Prior” and Bias

12 0.55111349 461 hunch net-2012-04-09-ICML author feedback is open

13 0.5358398 227 hunch net-2007-01-10-A Deep Belief Net Learning Problem

14 0.51875716 259 hunch net-2007-08-19-Choice of Metrics

15 0.51640594 129 hunch net-2005-11-07-Prediction Competitions

16 0.51321846 41 hunch net-2005-03-15-The State of Tight Bounds

17 0.51248753 282 hunch net-2008-01-06-Research Political Issues

18 0.51247722 67 hunch net-2005-05-06-Don’t mix the solution into the problem

19 0.5058164 40 hunch net-2005-03-13-Avoiding Bad Reviewing

20 0.50529486 5 hunch net-2005-01-26-Watchword: Probability