fast_ml fast_ml-2013 fast_ml-2013-23 knowledge-graph by maker-knowledge-mining

23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit


meta infos for this blog

Source: html

Introduction: The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples. We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set. What you wanted to know about Lasso and Ridge L1 and L2 are both ways of regularization sometimes called weight decay . Basically, we include parameter weights in a cost function. In effect, the model will try to minimize those weights by going “down the slope”. Example weights: in a linear model or in a neural network. L1 is known as Lasso and L2 is known as Ridge. These names may be confusing, because a chart of Lasso looks like a ridge and a


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples. [sent-1, score-0.27]

2 We present a way to select a few thousand relevant features using L1 (Lasso) regularization. [sent-2, score-0.135]

3 A linear model seems to work just as well with those selected features as with the full set. [sent-3, score-0.276]

4 What you wanted to know about Lasso and Ridge L1 and L2 are both ways of regularization sometimes called weight decay . [sent-5, score-0.346]

5 These names may be confusing, because a chart of Lasso looks like a ridge and a chart of Ridge looks like a lasso. [sent-10, score-0.467]

6 Take a look at Practical machine learning tricks from the KDD 2011 best industry paper : Throw a ton of features at the model and let L1 sparsity figure it out Feature representation is a crucial machine learning design decision. [sent-16, score-0.211]

7 They cast a very wide net in terms of representing an ad including words and topics used in the ad, links to and from the ad landing page, information about the advertiser, and more. [sent-17, score-0.312]

8 Ultimately they rely on strong L1 regularization to enforce sparsity and uncover a limited number of truly relevant features. [sent-18, score-0.249]

9 As you can see, most features come from job title and description (both bag of words) and from company names (categorical). [sent-23, score-0.623]

10 We will filter the features for these columns and leave the others intact. [sent-24, score-0.211]

11 Now we’d like to achieve the same, or at least similiar result, only with L1 regularization on. [sent-30, score-0.173]

12 We tune and when we get the settings needed for achieving this result, we run Vowpal Wabbit using a wrapper called vw-varinfo . [sent-32, score-0.181]

13 We delete them by hand in a text editor to retain only the features with non-zero RelScore and then we use this file as an input for the script converting the data. [sent-48, score-0.204]

14 In practice Note that if you want to produce a submission file, as opposed to perfroming validation, you need to join train and test files before converting them to VW format. [sent-51, score-0.138]

15 uk Apparently the amount of L1 regularization needed is rather small: vw -c -k --passes 13 --l1 0. [sent-58, score-0.437]

16 Here’s a sample of output for validation and then for the full set: average since example example current current current loss last counter weight label predict features (. [sent-62, score-0.746]

17 1907 120 average since example example current current current loss last counter weight label predict features (. [sent-80, score-0.687]

18 It seems that the jobs with “comunally” in the description are likely to pay well, as opposed to jobs with a title that contains the word “apprentice”. [sent-126, score-0.501]

19 We provide the script for transforming the data to libsvm format using only selected features. [sent-127, score-0.221]

20 We should finally mention that the format is also known as svm-light format, for example in scikit-learn . [sent-135, score-0.136]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('lasso', 0.468), ('fulldescription', 0.26), ('relscore', 0.208), ('title', 0.191), ('weight', 0.173), ('regularization', 0.173), ('ad', 0.156), ('current', 0.155), ('ridge', 0.138), ('features', 0.135), ('weights', 0.118), ('analyst', 0.104), ('apprentice', 0.104), ('contracttime', 0.104), ('contracttype', 0.104), ('dorking', 0.104), ('featurename', 0.104), ('hashval', 0.104), ('maxval', 0.104), ('minval', 0.104), ('sourcename', 0.104), ('surrey', 0.104), ('uk', 0.104), ('company', 0.104), ('vw', 0.097), ('jobs', 0.086), ('counter', 0.086), ('engineering', 0.086), ('selected', 0.082), ('category', 0.076), ('leave', 0.076), ('salary', 0.076), ('sparsity', 0.076), ('format', 0.074), ('converting', 0.069), ('description', 0.069), ('systems', 0.069), ('opposed', 0.069), ('looks', 0.069), ('names', 0.065), ('libsvm', 0.065), ('loss', 0.063), ('chart', 0.063), ('needed', 0.063), ('known', 0.062), ('tune', 0.059), ('settings', 0.059), ('bag', 0.059), ('full', 0.059), ('average', 0.055)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

Introduction: The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples. We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set. What you wanted to know about Lasso and Ridge L1 and L2 are both ways of regularization sometimes called weight decay . Basically, we include parameter weights in a cost function. In effect, the model will try to minimize those weights by going “down the slope”. Example weights: in a linear model or in a neural network. L1 is known as Lasso and L2 is known as Ridge. These names may be confusing, because a chart of Lasso looks like a ridge and a

2 0.20129505 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

3 0.16336559 20 fast ml-2013-02-18-Predicting advertised salaries

Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con

4 0.12234268 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit

Introduction: Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are: a neural network with a single hidden layer automatic creation of polynomial, specifically quadratic and cubic, features N-grams We describe how to use them, providing examples from the Kaggle Amazon competition and for the kin8nm dataset. Neural network The original motivation for creating neural network code in VW was to win some Kaggle competitions using only vee-dub , and that goal becomes much more feasible once you have a strong non-linear learner. The network seems to be a classic multi-layer perceptron with one sigmoidal hidden layer. More interestingly, it has dropout. Unfortunately, in a few tries we haven’t had much luck with the dropout. Here’s an example of how to create a network with 10 hidden units: vw -d data.vw --nn 10 Quadratic and cubic features The idea of quadratic features is to create all possible combinations between original features, so that

5 0.11582497 30 fast ml-2013-06-01-Amazon aspires to automate access control

Introduction: This is about Amazon access control challenge at Kaggle. Either we’re getting smarter, or the competition is easy. Or maybe both. You can beat the benchmark quite easily and with AUC of 0.875 you’d be comfortably in the top twenty percent at the moment. We scored fourth in our first attempt - the model was quick to develop and back then there were fewer competitors. Traditionally we use Vowpal Wabbit . Just simple binary classification with the logistic loss function and 10 passes over the data. It seems to work pretty well even though the classes are very unbalanced: there’s only a handful of negatives when compared to positives. Apparently Amazon employees usually get the access they request, even though sometimes they are refused. Let’s look at the data. First a label and then a bunch of IDs. 1,39353,85475,117961,118300,123472,117905,117906,290919,117908 1,17183,1540,117961,118343,123125,118536,118536,308574,118539 1,36724,14457,118219,118220,117884,117879,267952

6 0.099706687 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

7 0.098444477 26 fast ml-2013-04-17-Regression as classification

8 0.090738975 19 fast ml-2013-02-07-The secret of the big guys

9 0.089571521 33 fast ml-2013-07-09-Introducing phraug

10 0.087891191 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition

11 0.083719291 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

12 0.083692372 25 fast ml-2013-04-10-Gender discrimination

13 0.080321185 17 fast ml-2013-01-14-Feature selection in practice

14 0.076533645 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

15 0.071113564 46 fast ml-2013-12-07-13 NIPS papers that caught our eye

16 0.064485945 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

17 0.063661858 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial

18 0.063140333 42 fast ml-2013-10-28-How much data is enough?

19 0.062759124 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

20 0.060306076 2 fast ml-2012-08-27-Kaggle job recommendation challenge


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.299), (1, -0.245), (2, -0.027), (3, 0.084), (4, 0.061), (5, 0.027), (6, 0.077), (7, -0.042), (8, 0.084), (9, 0.05), (10, 0.073), (11, 0.029), (12, 0.072), (13, -0.039), (14, -0.089), (15, -0.052), (16, 0.166), (17, 0.244), (18, -0.129), (19, -0.182), (20, 0.04), (21, -0.044), (22, -0.048), (23, -0.079), (24, 0.103), (25, -0.114), (26, -0.133), (27, -0.05), (28, -0.093), (29, -0.019), (30, 0.258), (31, -0.199), (32, 0.078), (33, 0.231), (34, -0.132), (35, 0.114), (36, -0.028), (37, -0.291), (38, 0.023), (39, 0.055), (40, -0.313), (41, 0.253), (42, 0.09), (43, 0.153), (44, 0.065), (45, 0.093), (46, -0.184), (47, 0.026), (48, 0.173), (49, 0.016)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9602986 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

Introduction: The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples. We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set. What you wanted to know about Lasso and Ridge L1 and L2 are both ways of regularization sometimes called weight decay . Basically, we include parameter weights in a cost function. In effect, the model will try to minimize those weights by going “down the slope”. Example weights: in a linear model or in a neural network. L1 is known as Lasso and L2 is known as Ridge. These names may be confusing, because a chart of Lasso looks like a ridge and a

2 0.37539861 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

3 0.29595301 20 fast ml-2013-02-18-Predicting advertised salaries

Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con

4 0.22318554 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit

Introduction: Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are: a neural network with a single hidden layer automatic creation of polynomial, specifically quadratic and cubic, features N-grams We describe how to use them, providing examples from the Kaggle Amazon competition and for the kin8nm dataset. Neural network The original motivation for creating neural network code in VW was to win some Kaggle competitions using only vee-dub , and that goal becomes much more feasible once you have a strong non-linear learner. The network seems to be a classic multi-layer perceptron with one sigmoidal hidden layer. More interestingly, it has dropout. Unfortunately, in a few tries we haven’t had much luck with the dropout. Here’s an example of how to create a network with 10 hidden units: vw -d data.vw --nn 10 Quadratic and cubic features The idea of quadratic features is to create all possible combinations between original features, so that

5 0.21092461 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

6 0.18829854 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

7 0.17714299 19 fast ml-2013-02-07-The secret of the big guys

8 0.16781282 17 fast ml-2013-01-14-Feature selection in practice

9 0.16142598 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition

10 0.15546709 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

11 0.15193444 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

12 0.15065633 25 fast ml-2013-04-10-Gender discrimination

13 0.15016669 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

14 0.14067936 27 fast ml-2013-05-01-Deep learning made easy

15 0.1401695 30 fast ml-2013-06-01-Amazon aspires to automate access control

16 0.13395956 42 fast ml-2013-10-28-How much data is enough?

17 0.13246335 46 fast ml-2013-12-07-13 NIPS papers that caught our eye

18 0.13028431 33 fast ml-2013-07-09-Introducing phraug

19 0.12741832 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python

20 0.12687944 26 fast ml-2013-04-17-Regression as classification


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(26, 0.075), (31, 0.06), (35, 0.034), (50, 0.036), (55, 0.047), (56, 0.391), (58, 0.028), (69, 0.12), (71, 0.024), (78, 0.036), (79, 0.023), (99, 0.04)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.83681959 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

Introduction: The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples. We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set. What you wanted to know about Lasso and Ridge L1 and L2 are both ways of regularization sometimes called weight decay . Basically, we include parameter weights in a cost function. In effect, the model will try to minimize those weights by going “down the slope”. Example weights: in a linear model or in a neural network. L1 is known as Lasso and L2 is known as Ridge. These names may be confusing, because a chart of Lasso looks like a ridge and a

2 0.36158615 20 fast ml-2013-02-18-Predicting advertised salaries

Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con

3 0.35402018 19 fast ml-2013-02-07-The secret of the big guys

Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni

4 0.34484804 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

5 0.32945484 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

Introduction: We continue with CIFAR-10-based competition at Kaggle to get to know DropConnect. It’s supposed to be an improvement over dropout. And dropout is certainly one of the bigger steps forward in neural network development. Is DropConnect really better than dropout? TL;DR DropConnect seems to offer results similiar to dropout. State of the art scores reported in the paper come from model ensembling. Dropout Dropout , by Hinton et al., is perhaps a biggest invention in the field of neural networks in recent years. It adresses the main problem in machine learning, that is overfitting. It does so by “dropping out” some unit activations in a given layer, that is setting them to zero. Thus it prevents co-adaptation of units and can also be seen as a method of ensembling many networks sharing the same weights. For each training example a different set of units to drop is randomly chosen. The idea has a biological inspiration . When a child is conceived, it receives half its genes f

6 0.32679179 18 fast ml-2013-01-17-A very fast denoising autoencoder

7 0.32672608 40 fast ml-2013-10-06-Pylearn2 in practice

8 0.32068029 27 fast ml-2013-05-01-Deep learning made easy

9 0.31696224 9 fast ml-2012-10-25-So you want to work for Facebook

10 0.31567523 17 fast ml-2013-01-14-Feature selection in practice

11 0.31194943 13 fast ml-2012-12-27-Spearmint with a random forest

12 0.31186178 43 fast ml-2013-11-02-Maxing out the digits

13 0.31005594 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python

14 0.30954465 26 fast ml-2013-04-17-Regression as classification

15 0.30836022 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

16 0.30822608 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

17 0.30721173 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

18 0.30697784 25 fast ml-2013-04-10-Gender discrimination

19 0.30353829 61 fast ml-2014-05-08-Impute missing values with Amelia

20 0.30293781 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview