fast_ml fast_ml-2013 fast_ml-2013-20 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
sentIndex sentText sentNum sentScore
1 This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. [sent-2, score-0.786]
2 More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). [sent-11, score-0.401]
3 The competition is about predicting salaries from job adverts. [sent-12, score-0.649]
4 Code for predicting advertised salaries is available at Github. [sent-19, score-0.608]
5 The distribution of salaries VW, being a linear learner, probably expects the target variable to have a normal distribution. [sent-20, score-0.654]
6 We checked empirically ;) Therefore, it is important to log transform salaries, because the original distribution is skewed: After the transform, we get something closer to a bell curve. [sent-21, score-0.701]
7 Another option would be to take a square root of salaries: Validation shows that the log transform works better in this case. [sent-24, score-0.298]
8 Validation For validation purposes, we convert a training set file to a VW format, then split it randomly into train and validation sets (95% train, 5% validation). [sent-25, score-0.653]
9 By explicit we mean that limiting a number of passes works as regularization. [sent-29, score-0.376]
10 csv file as the test file, because that’s what it is at the moment (there’s no labels in there). [sent-31, score-0.221]
11 To produce a submission file we need to prepare the data a little bit, namely convert it to VW format with 2vw. [sent-32, score-0.351]
12 You could just give VW a value of a categorical variable with spaces removed to achieve the same effect, but we had the binarizing code handy. [sent-37, score-0.39]
13 The advantage of employing it (it’s a post about job salaries, isn’t it? [sent-38, score-0.227]
14 We think that it’s best to convert training and test files together, that is: combine them, convert, and then split back. [sent-40, score-0.407]
15 To do this, we need to add dummy salary columns to the test file, so that the columns in both files match. [sent-41, score-0.619]
16 vw 999999999 244768 The third argument to first. [sent-60, score-0.393]
17 Running VW and the result Now let’s run VW: vw -d data/train. [sent-63, score-0.302]
18 txt Predictions come out on the log scale, so the last step is to convert them back. [sent-66, score-0.283]
19 The result is pretty good: The remarkable thing is that the test score is almost the same as the validation score, which was 7149. [sent-69, score-0.227]
20 This means that the train and test files come from the same distribution - a good thing for predicting. [sent-70, score-0.376]
wordName wordTfidf (topN-words)
[('salaries', 0.376), ('vw', 0.302), ('distribution', 0.202), ('binarizing', 0.162), ('passes', 0.161), ('transform', 0.161), ('job', 0.16), ('convert', 0.146), ('validation', 0.142), ('log', 0.137), ('file', 0.136), ('explicit', 0.135), ('cleverer', 0.135), ('columns', 0.129), ('advertised', 0.119), ('salary', 0.119), ('predicting', 0.113), ('text', 0.096), ('argument', 0.091), ('files', 0.089), ('split', 0.087), ('refer', 0.085), ('categorical', 0.085), ('test', 0.085), ('mean', 0.08), ('variable', 0.076), ('format', 0.069), ('forest', 0.069), ('add', 0.068), ('beats', 0.067), ('merck', 0.067), ('removed', 0.067), ('sounds', 0.067), ('ads', 0.067), ('appear', 0.067), ('bell', 0.067), ('ceiling', 0.067), ('checked', 0.067), ('congratulations', 0.067), ('describing', 0.067), ('empirically', 0.067), ('employing', 0.067), ('intuitive', 0.067), ('saying', 0.067), ('spotted', 0.067), ('starts', 0.067), ('algorithm', 0.064), ('course', 0.064), ('model', 0.062), ('back', 0.061)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999958 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
2 0.27332976 26 fast ml-2013-04-17-Regression as classification
Introduction: An interesting development occured in Job salary prediction at Kaggle: the guy who ranked 3rd used logistic regression , in spite of the task being regression, not classification. We attempt to replicate the experiment. The idea is to discretize salaries into a number of bins, just like with a histogram. Guocong Song , the man, used 30 bins. We like a convenient uniform bin width of 0.1, as a minimum log salary in the training set is 8.5 and a maximum is 12.2. Since there are few examples in the high end, we stop at 12.0, so that gives us 36 bins. Here’s the code: import numpy as np min_salary = 8.5 max_salary = 12.0 interval = 0.1 a_range = np.arange( min_salary, max_salary + interval, interval ) class_mapping = {} for i, n in enumerate( a_range ): n = round( n, 1 ) class_mapping[n] = i + 1 This way we get a mapping from log salaries to classes. Class labels start with 1, because Vowpal Wabbit expects that, and we intend to use VW. The code can be
3 0.23419137 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.
4 0.18825057 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
Introduction: Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are: a neural network with a single hidden layer automatic creation of polynomial, specifically quadratic and cubic, features N-grams We describe how to use them, providing examples from the Kaggle Amazon competition and for the kin8nm dataset. Neural network The original motivation for creating neural network code in VW was to win some Kaggle competitions using only vee-dub , and that goal becomes much more feasible once you have a strong non-linear learner. The network seems to be a classic multi-layer perceptron with one sigmoidal hidden layer. More interestingly, it has dropout. Unfortunately, in a few tries we haven’t had much luck with the dropout. Here’s an example of how to create a network with 10 hidden units: vw -d data.vw --nn 10 Quadratic and cubic features The idea of quadratic features is to create all possible combinations between original features, so that
5 0.17526704 30 fast ml-2013-06-01-Amazon aspires to automate access control
Introduction: This is about Amazon access control challenge at Kaggle. Either we’re getting smarter, or the competition is easy. Or maybe both. You can beat the benchmark quite easily and with AUC of 0.875 you’d be comfortably in the top twenty percent at the moment. We scored fourth in our first attempt - the model was quick to develop and back then there were fewer competitors. Traditionally we use Vowpal Wabbit . Just simple binary classification with the logistic loss function and 10 passes over the data. It seems to work pretty well even though the classes are very unbalanced: there’s only a handful of negatives when compared to positives. Apparently Amazon employees usually get the access they request, even though sometimes they are refused. Let’s look at the data. First a label and then a bunch of IDs. 1,39353,85475,117961,118300,123472,117905,117906,290919,117908 1,17183,1540,117961,118343,123125,118536,118536,308574,118539 1,36724,14457,118219,118220,117884,117879,267952
6 0.17016833 33 fast ml-2013-07-09-Introducing phraug
7 0.16336559 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
8 0.1461945 25 fast ml-2013-04-10-Gender discrimination
9 0.1404155 8 fast ml-2012-10-15-Merck challenge
10 0.13417998 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
11 0.13249792 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
12 0.126537 32 fast ml-2013-07-05-Processing large files, line by line
13 0.12467095 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
14 0.12307119 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
15 0.11600675 43 fast ml-2013-11-02-Maxing out the digits
16 0.10598504 19 fast ml-2013-02-07-The secret of the big guys
17 0.10055304 13 fast ml-2012-12-27-Spearmint with a random forest
18 0.091340959 40 fast ml-2013-10-06-Pylearn2 in practice
19 0.087798342 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
20 0.085501149 18 fast ml-2013-01-17-A very fast denoising autoencoder
topicId topicWeight
[(0, 0.429), (1, -0.338), (2, -0.146), (3, 0.15), (4, -0.009), (5, 0.006), (6, 0.108), (7, -0.011), (8, 0.054), (9, -0.066), (10, -0.14), (11, 0.071), (12, 0.008), (13, 0.076), (14, -0.041), (15, -0.009), (16, -0.075), (17, 0.029), (18, 0.122), (19, -0.041), (20, 0.187), (21, -0.059), (22, -0.067), (23, -0.074), (24, 0.037), (25, -0.033), (26, 0.103), (27, -0.093), (28, -0.085), (29, -0.142), (30, 0.051), (31, 0.08), (32, -0.11), (33, 0.013), (34, 0.06), (35, -0.063), (36, 0.039), (37, 0.016), (38, 0.142), (39, 0.12), (40, -0.082), (41, -0.137), (42, -0.045), (43, -0.105), (44, 0.076), (45, 0.026), (46, 0.096), (47, -0.052), (48, 0.111), (49, -0.137)]
simIndex simValue blogId blogTitle
same-blog 1 0.96822 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
2 0.72959328 26 fast ml-2013-04-17-Regression as classification
Introduction: An interesting development occured in Job salary prediction at Kaggle: the guy who ranked 3rd used logistic regression , in spite of the task being regression, not classification. We attempt to replicate the experiment. The idea is to discretize salaries into a number of bins, just like with a histogram. Guocong Song , the man, used 30 bins. We like a convenient uniform bin width of 0.1, as a minimum log salary in the training set is 8.5 and a maximum is 12.2. Since there are few examples in the high end, we stop at 12.0, so that gives us 36 bins. Here’s the code: import numpy as np min_salary = 8.5 max_salary = 12.0 interval = 0.1 a_range = np.arange( min_salary, max_salary + interval, interval ) class_mapping = {} for i, n in enumerate( a_range ): n = round( n, 1 ) class_mapping[n] = i + 1 This way we get a mapping from log salaries to classes. Class labels start with 1, because Vowpal Wabbit expects that, and we intend to use VW. The code can be
3 0.43079349 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
Introduction: Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are: a neural network with a single hidden layer automatic creation of polynomial, specifically quadratic and cubic, features N-grams We describe how to use them, providing examples from the Kaggle Amazon competition and for the kin8nm dataset. Neural network The original motivation for creating neural network code in VW was to win some Kaggle competitions using only vee-dub , and that goal becomes much more feasible once you have a strong non-linear learner. The network seems to be a classic multi-layer perceptron with one sigmoidal hidden layer. More interestingly, it has dropout. Unfortunately, in a few tries we haven’t had much luck with the dropout. Here’s an example of how to create a network with 10 hidden units: vw -d data.vw --nn 10 Quadratic and cubic features The idea of quadratic features is to create all possible combinations between original features, so that
4 0.41633236 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.
5 0.41168591 25 fast ml-2013-04-10-Gender discrimination
Introduction: There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not? Congratulations! You have spotted the ceiling cat. As Sashi noticed on the forums , it’s not difficult to improve on the benchmarks a little bit. In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. Here’s our version of the story. Let’s start with normalizing. There’s a nice function for that in R, scale() . The dataset is small, 1128 examples, so we can go ahead and use R. Turns out that in its raw form, the data won’t scale. That’s because apparently there are some columns with zeros only. That makes it difficult to divide. Fortunately, we know just the right tool for the task. We learned about it on Kaggle forums too. It’s a function in caret package called nearZeroVar() . It will give you indexes of all the columns which have near zero var
6 0.38666427 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
7 0.36904794 8 fast ml-2012-10-15-Merck challenge
8 0.34806931 33 fast ml-2013-07-09-Introducing phraug
9 0.32091513 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
10 0.31739488 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
11 0.31285745 32 fast ml-2013-07-05-Processing large files, line by line
12 0.30158323 43 fast ml-2013-11-02-Maxing out the digits
13 0.29417661 30 fast ml-2013-06-01-Amazon aspires to automate access control
14 0.28965738 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
15 0.2685163 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
16 0.2652618 19 fast ml-2013-02-07-The secret of the big guys
17 0.25723931 13 fast ml-2012-12-27-Spearmint with a random forest
18 0.23513724 35 fast ml-2013-08-12-Accelerometer Biometric Competition
19 0.22588596 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
20 0.21980101 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
topicId topicWeight
[(6, 0.016), (26, 0.073), (31, 0.032), (35, 0.012), (48, 0.016), (50, 0.397), (55, 0.081), (58, 0.024), (69, 0.169), (71, 0.035), (79, 0.011), (81, 0.025), (99, 0.04)]
simIndex simValue blogId blogTitle
same-blog 1 0.84654194 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
2 0.4418394 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
Introduction: The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples. We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set. What you wanted to know about Lasso and Ridge L1 and L2 are both ways of regularization sometimes called weight decay . Basically, we include parameter weights in a cost function. In effect, the model will try to minimize those weights by going “down the slope”. Example weights: in a linear model or in a neural network. L1 is known as Lasso and L2 is known as Ridge. These names may be confusing, because a chart of Lasso looks like a ridge and a
3 0.40632349 18 fast ml-2013-01-17-A very fast denoising autoencoder
Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension
4 0.40390411 27 fast ml-2013-05-01-Deep learning made easy
Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends
5 0.40283403 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
6 0.39999571 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
7 0.39824647 13 fast ml-2012-12-27-Spearmint with a random forest
8 0.39051425 32 fast ml-2013-07-05-Processing large files, line by line
9 0.38835359 17 fast ml-2013-01-14-Feature selection in practice
10 0.38775444 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
11 0.38648391 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
12 0.38579494 26 fast ml-2013-04-17-Regression as classification
13 0.38319495 40 fast ml-2013-10-06-Pylearn2 in practice
14 0.38295713 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
15 0.38093022 43 fast ml-2013-11-02-Maxing out the digits
16 0.37725171 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
17 0.37649632 25 fast ml-2013-04-10-Gender discrimination
18 0.37256175 9 fast ml-2012-10-25-So you want to work for Facebook
19 0.37061915 35 fast ml-2013-08-12-Accelerometer Biometric Competition
20 0.36855546 19 fast ml-2013-02-07-The secret of the big guys