fast_ml fast_ml-2012 fast_ml-2012-8 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Today it’s about Merck challenge - let’s beat the benchmark real quick. Not by much, but quick. If you look at the source code, you’ll notice this: # data sets 1 and 6 are too large to fit into memory and run basic # random forest. Sample 20% of data set instead. if (i== 1 | i==6) { Nrows = length(train[,1]) train <- train[sample(Nrows, as.integer(0.2*Nrows)),] } It means that sets one and six are used in 20% only. This suggests an angle of attack, because more data beats a cleverer algorithm [1]. Let’s try our pal VW. We’ll just convert those sets to Vowpal Wabbit format , run training, run prediction, and convert results to Kaggle format. OK, 0.39, we’re done for the evening. Earlier we tried training a random forest implementation which could take the whole set into memory, but it took maybe an hour to run anyway, and the result of that first attempt wasn’t so good. We’re not into this kind of tempo so we explored other possibilities. [1] Pedro Do
sentIndex sentText sentNum sentScore
1 Today it’s about Merck challenge - let’s beat the benchmark real quick. [sent-1, score-0.279]
2 If you look at the source code, you’ll notice this: # data sets 1 and 6 are too large to fit into memory and run basic # random forest. [sent-3, score-0.958]
3 if (i== 1 | i==6) { Nrows = length(train[,1]) train <- train[sample(Nrows, as. [sent-5, score-0.088]
4 2*Nrows)),] } It means that sets one and six are used in 20% only. [sent-7, score-0.347]
5 This suggests an angle of attack, because more data beats a cleverer algorithm [1]. [sent-8, score-0.527]
6 We’ll just convert those sets to Vowpal Wabbit format , run training, run prediction, and convert results to Kaggle format. [sent-10, score-0.788]
7 Earlier we tried training a random forest implementation which could take the whole set into memory, but it took maybe an hour to run anyway, and the result of that first attempt wasn’t so good. [sent-13, score-0.925]
8 We’re not into this kind of tempo so we explored other possibilities. [sent-14, score-0.257]
wordName wordTfidf (topN-words)
[('nrows', 0.579), ('sets', 0.219), ('memory', 0.18), ('angle', 0.161), ('beats', 0.161), ('cleverer', 0.161), ('explored', 0.161), ('hour', 0.161), ('merck', 0.161), ('pedro', 0.161), ('sample', 0.154), ('earlier', 0.142), ('anyway', 0.142), ('length', 0.142), ('today', 0.142), ('convert', 0.139), ('six', 0.128), ('suggests', 0.128), ('run', 0.118), ('wasn', 0.118), ('fit', 0.109), ('implementation', 0.109), ('took', 0.102), ('basic', 0.096), ('kind', 0.096), ('notice', 0.096), ('attempt', 0.09), ('train', 0.088), ('source', 0.085), ('random', 0.085), ('algorithm', 0.077), ('ok', 0.077), ('prediction', 0.077), ('whole', 0.077), ('beat', 0.073), ('vowpal', 0.073), ('wabbit', 0.073), ('real', 0.073), ('tried', 0.073), ('let', 0.071), ('ll', 0.071), ('challenge', 0.07), ('useful', 0.07), ('done', 0.066), ('large', 0.066), ('benchmark', 0.063), ('re', 0.062), ('format', 0.055), ('forest', 0.055), ('result', 0.055)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 8 fast ml-2012-10-15-Merck challenge
Introduction: Today it’s about Merck challenge - let’s beat the benchmark real quick. Not by much, but quick. If you look at the source code, you’ll notice this: # data sets 1 and 6 are too large to fit into memory and run basic # random forest. Sample 20% of data set instead. if (i== 1 | i==6) { Nrows = length(train[,1]) train <- train[sample(Nrows, as.integer(0.2*Nrows)),] } It means that sets one and six are used in 20% only. This suggests an angle of attack, because more data beats a cleverer algorithm [1]. Let’s try our pal VW. We’ll just convert those sets to Vowpal Wabbit format , run training, run prediction, and convert results to Kaggle format. OK, 0.39, we’re done for the evening. Earlier we tried training a random forest implementation which could take the whole set into memory, but it took maybe an hour to run anyway, and the result of that first attempt wasn’t so good. We’re not into this kind of tempo so we explored other possibilities. [1] Pedro Do
2 0.1404155 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
3 0.099793956 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.
4 0.064269744 32 fast ml-2013-07-05-Processing large files, line by line
Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re
5 0.062899999 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
6 0.062795669 27 fast ml-2013-05-01-Deep learning made easy
7 0.058985438 16 fast ml-2013-01-12-Intro to random forests
8 0.05772128 5 fast ml-2012-09-19-Best Buy mobile contest - big data
9 0.054883767 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
10 0.054356877 30 fast ml-2013-06-01-Amazon aspires to automate access control
11 0.053776111 33 fast ml-2013-07-09-Introducing phraug
12 0.05158091 22 fast ml-2013-03-07-Choosing a machine learning algorithm
13 0.05119919 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
14 0.047757279 19 fast ml-2013-02-07-The secret of the big guys
15 0.046852022 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
16 0.04683305 18 fast ml-2013-01-17-A very fast denoising autoencoder
17 0.046262894 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
18 0.045328155 26 fast ml-2013-04-17-Regression as classification
19 0.044080876 17 fast ml-2013-01-14-Feature selection in practice
20 0.043509707 43 fast ml-2013-11-02-Maxing out the digits
topicId topicWeight
[(0, 0.203), (1, -0.084), (2, -0.031), (3, 0.008), (4, 0.016), (5, 0.045), (6, 0.018), (7, 0.001), (8, 0.027), (9, -0.124), (10, -0.166), (11, -0.013), (12, 0.016), (13, 0.124), (14, -0.098), (15, 0.049), (16, -0.285), (17, -0.07), (18, 0.326), (19, 0.293), (20, 0.03), (21, -0.258), (22, 0.056), (23, -0.454), (24, 0.087), (25, -0.29), (26, 0.061), (27, -0.072), (28, 0.111), (29, 0.067), (30, 0.067), (31, -0.103), (32, 0.217), (33, -0.177), (34, -0.095), (35, -0.01), (36, 0.073), (37, -0.037), (38, 0.055), (39, -0.064), (40, -0.063), (41, -0.038), (42, -0.016), (43, 0.132), (44, -0.106), (45, -0.117), (46, 0.039), (47, -0.03), (48, 0.024), (49, -0.035)]
simIndex simValue blogId blogTitle
same-blog 1 0.98205602 8 fast ml-2012-10-15-Merck challenge
Introduction: Today it’s about Merck challenge - let’s beat the benchmark real quick. Not by much, but quick. If you look at the source code, you’ll notice this: # data sets 1 and 6 are too large to fit into memory and run basic # random forest. Sample 20% of data set instead. if (i== 1 | i==6) { Nrows = length(train[,1]) train <- train[sample(Nrows, as.integer(0.2*Nrows)),] } It means that sets one and six are used in 20% only. This suggests an angle of attack, because more data beats a cleverer algorithm [1]. Let’s try our pal VW. We’ll just convert those sets to Vowpal Wabbit format , run training, run prediction, and convert results to Kaggle format. OK, 0.39, we’re done for the evening. Earlier we tried training a random forest implementation which could take the whole set into memory, but it took maybe an hour to run anyway, and the result of that first attempt wasn’t so good. We’re not into this kind of tempo so we explored other possibilities. [1] Pedro Do
2 0.31685543 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
3 0.22938269 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.
4 0.15880127 27 fast ml-2013-05-01-Deep learning made easy
Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends
5 0.1410097 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
6 0.13910976 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
7 0.12660757 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
8 0.1252367 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
9 0.12298259 17 fast ml-2013-01-14-Feature selection in practice
10 0.12256091 5 fast ml-2012-09-19-Best Buy mobile contest - big data
11 0.12033246 32 fast ml-2013-07-05-Processing large files, line by line
12 0.11987117 43 fast ml-2013-11-02-Maxing out the digits
13 0.1179184 18 fast ml-2013-01-17-A very fast denoising autoencoder
14 0.1138752 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
15 0.11113889 22 fast ml-2013-03-07-Choosing a machine learning algorithm
16 0.10785937 13 fast ml-2012-12-27-Spearmint with a random forest
17 0.10302666 19 fast ml-2013-02-07-The secret of the big guys
18 0.1028391 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
19 0.10270887 16 fast ml-2013-01-12-Intro to random forests
20 0.10142951 25 fast ml-2013-04-10-Gender discrimination
topicId topicWeight
[(31, 0.03), (55, 0.018), (69, 0.166), (71, 0.063), (81, 0.553), (99, 0.045)]
simIndex simValue blogId blogTitle
same-blog 1 0.88312191 8 fast ml-2012-10-15-Merck challenge
Introduction: Today it’s about Merck challenge - let’s beat the benchmark real quick. Not by much, but quick. If you look at the source code, you’ll notice this: # data sets 1 and 6 are too large to fit into memory and run basic # random forest. Sample 20% of data set instead. if (i== 1 | i==6) { Nrows = length(train[,1]) train <- train[sample(Nrows, as.integer(0.2*Nrows)),] } It means that sets one and six are used in 20% only. This suggests an angle of attack, because more data beats a cleverer algorithm [1]. Let’s try our pal VW. We’ll just convert those sets to Vowpal Wabbit format , run training, run prediction, and convert results to Kaggle format. OK, 0.39, we’re done for the evening. Earlier we tried training a random forest implementation which could take the whole set into memory, but it took maybe an hour to run anyway, and the result of that first attempt wasn’t so good. We’re not into this kind of tempo so we explored other possibilities. [1] Pedro Do
2 0.31745574 27 fast ml-2013-05-01-Deep learning made easy
Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends
3 0.30210018 9 fast ml-2012-10-25-So you want to work for Facebook
Introduction: Good news, everyone! There’s a new contest on Kaggle - Facebook is looking for talent . They won’t pay, but just might interview. This post is in a way a bonus for active readers because most visitors of fastml.com originally come from Kaggle forums. For this competition the forums are disabled to encourage own work . To honor this, we won’t publish any code. But own work doesn’t mean original work , and we wouldn’t want to reinvent the wheel, would we? The contest differs substantially from a Kaggle stereotype, if there is such a thing, in three major ways: there’s no money prizes, as mentioned above it’s not a real world problem, but rather an assignment to screen job candidates (this has important consequences, described below) it’s not a typical machine learning project, but rather a broader AI exercise You are given a graph of the internet, actually a snapshot of the graph for each of 15 time steps. You are also given a bunch of paths in this graph, which a
4 0.30094895 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
5 0.30048722 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
6 0.29428804 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
7 0.29414046 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
8 0.29293481 17 fast ml-2013-01-14-Feature selection in practice
9 0.29155377 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
10 0.29088238 32 fast ml-2013-07-05-Processing large files, line by line
11 0.28290623 16 fast ml-2013-01-12-Intro to random forests
12 0.27818808 18 fast ml-2013-01-17-A very fast denoising autoencoder
13 0.271272 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
14 0.27066931 43 fast ml-2013-11-02-Maxing out the digits
15 0.27024904 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
16 0.27017844 25 fast ml-2013-04-10-Gender discrimination
17 0.26037943 36 fast ml-2013-08-23-A bag of words and a nice little network
18 0.25880426 19 fast ml-2013-02-07-The secret of the big guys
19 0.25754157 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
20 0.255813 35 fast ml-2013-08-12-Accelerometer Biometric Competition