fast_ml fast_ml-2012 fast_ml-2012-13 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
sentIndex sentText sentNum sentScore
1 Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). [sent-1, score-1.647]
2 It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. [sent-4, score-0.266]
3 We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. [sent-6, score-0.357]
4 At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). [sent-8, score-0.926]
5 Here are the results of two Spearmint runs with 71 and 95 tries respectively. [sent-9, score-0.349]
6 Colors denote a validation error value: green : RMSE < 0. [sent-10, score-0.294]
7 58 Turns out that some differences in the error value, even though not that big, are present, so a little effort to choose good hyperparams makes sense. [sent-13, score-0.466]
8 A number of trees is on the horizontal axis and the number of candidate features on the vertical. [sent-14, score-0.526]
9 As you can see, the two runs are pretty similar. [sent-15, score-0.19]
10 The number of trees doesn’t matter so much as long as it is big enough - good results can be achieved with less than a hundred tress. [sent-16, score-0.28]
11 However, mtry parameter is clearly very important - most of good runs used a value of three. [sent-17, score-1.044]
12 This is close to a default setting, which is a square root of number of features (3. [sent-18, score-0.518]
13 If three is so good, we might try setting the lower bound to two or even one. [sent-20, score-0.551]
14 Here is the plot of the third, final run with 141 tries: It seems that two is even better for mtry than three. [sent-22, score-0.722]
15 And indeed, a best found combination is 158 trees, mtry 2. [sent-23, score-0.606]
16 Random forests are intrinsically random, so in another run with these settings you will probably get a worse result than that or a lower error with a different number of trees. [sent-26, score-0.487]
17 Let’s see how the error depends on mtry (we omit a few outliers to get a close-up): The best value for mtry is probably two, three being a runner up. [sent-27, score-1.554]
18 So maybe we discovered a better than default value for this hyperparam after all. [sent-28, score-0.31]
19 To verify this, we will train ten random forests with the default setting and ten forests with mtry = 2. [sent-29, score-1.583]
20 If you’d like to learn more about impact of mtry on random forest accuracy, there’s a whole paper on this subject by Bernard et al . [sent-34, score-0.758]
wordName wordTfidf (topN-words)
[('mtry', 0.606), ('value', 0.179), ('ntrees', 0.165), ('setting', 0.164), ('error', 0.163), ('forests', 0.151), ('wine', 0.137), ('candidate', 0.137), ('dramatic', 0.137), ('default', 0.131), ('trees', 0.131), ('rmse', 0.131), ('runs', 0.131), ('hyperparams', 0.116), ('ten', 0.11), ('bound', 0.11), ('root', 0.11), ('square', 0.11), ('spearmint', 0.101), ('tries', 0.101), ('improvement', 0.101), ('regression', 0.099), ('number', 0.091), ('random', 0.091), ('lower', 0.082), ('try', 0.079), ('features', 0.076), ('parameter', 0.073), ('differences', 0.069), ('nailed', 0.069), ('sensible', 0.069), ('rating', 0.069), ('denote', 0.069), ('prepared', 0.069), ('verify', 0.069), ('validation', 0.062), ('ratings', 0.061), ('concretely', 0.061), ('regular', 0.061), ('basics', 0.061), ('effort', 0.061), ('impact', 0.061), ('limit', 0.061), ('predict', 0.06), ('split', 0.06), ('two', 0.059), ('results', 0.058), ('even', 0.057), ('treat', 0.055), ('clearly', 0.055)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
2 0.58491415 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
3 0.24709208 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
4 0.15169109 18 fast ml-2013-01-17-A very fast denoising autoencoder
Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension
5 0.14282188 16 fast ml-2013-01-12-Intro to random forests
Introduction: Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive? First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks. There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting). Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods. Besides effectivnes
6 0.12474725 11 fast ml-2012-12-07-Predicting wine quality
7 0.1012109 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
8 0.10055304 20 fast ml-2013-02-18-Predicting advertised salaries
9 0.09978155 43 fast ml-2013-11-02-Maxing out the digits
10 0.098353028 19 fast ml-2013-02-07-The secret of the big guys
11 0.082037494 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
12 0.079490319 22 fast ml-2013-03-07-Choosing a machine learning algorithm
13 0.073086999 17 fast ml-2013-01-14-Feature selection in practice
14 0.072775565 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
15 0.068612903 27 fast ml-2013-05-01-Deep learning made easy
16 0.067543872 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
17 0.067305461 26 fast ml-2013-04-17-Regression as classification
18 0.067094244 25 fast ml-2013-04-10-Gender discrimination
19 0.056808162 46 fast ml-2013-12-07-13 NIPS papers that caught our eye
20 0.055593427 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
topicId topicWeight
[(0, 0.337), (1, 0.353), (2, -0.618), (3, -0.07), (4, 0.01), (5, 0.011), (6, 0.077), (7, -0.058), (8, 0.129), (9, -0.064), (10, 0.025), (11, 0.12), (12, 0.05), (13, 0.04), (14, 0.069), (15, 0.043), (16, -0.103), (17, 0.023), (18, -0.052), (19, -0.05), (20, 0.034), (21, 0.039), (22, -0.001), (23, 0.05), (24, -0.007), (25, -0.017), (26, 0.019), (27, -0.016), (28, -0.003), (29, 0.003), (30, -0.009), (31, -0.038), (32, -0.056), (33, 0.064), (34, -0.082), (35, 0.004), (36, 0.001), (37, 0.097), (38, -0.009), (39, -0.023), (40, 0.072), (41, -0.095), (42, -0.035), (43, 0.109), (44, -0.061), (45, 0.057), (46, -0.136), (47, -0.02), (48, 0.007), (49, 0.046)]
simIndex simValue blogId blogTitle
same-blog 1 0.97530484 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
2 0.92765242 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
3 0.48668891 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
4 0.28000686 16 fast ml-2013-01-12-Intro to random forests
Introduction: Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive? First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks. There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting). Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods. Besides effectivnes
5 0.26502252 43 fast ml-2013-11-02-Maxing out the digits
Introduction: Recently we’ve been investigating the basics of Pylearn2 . Now it’s time for a more advanced example: a multilayer perceptron with dropout and maxout activation for the MNIST digits. Maxout explained If you’ve been following developments in deep learning, you know that Hinton’s most recent recommendation for supervised learning, after a few years of bashing backpropagation in favour of unsupervised pretraining, is to use classic multilayer perceptrons with dropout and rectified linear units. For us, this breath of simplicity is a welcome change. Rectified linear is f(x) = max( 0, x ) . This makes backpropagation trivial: for x > 0, the derivative is one, else zero. Note that ReLU consists of two linear functions. But why stop at two? Let’s take max. out of three, or four, or five linear functions… And so maxout is a generalization of ReLU. It can approximate any convex function. Now backpropagation is easy and dropout prevents overfitting, so we can train a deep
6 0.24646164 18 fast ml-2013-01-17-A very fast denoising autoencoder
7 0.23139198 20 fast ml-2013-02-18-Predicting advertised salaries
8 0.22760879 11 fast ml-2012-12-07-Predicting wine quality
9 0.20887588 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
10 0.18924034 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
11 0.18486634 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
12 0.17758115 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
13 0.17423916 27 fast ml-2013-05-01-Deep learning made easy
14 0.17220953 17 fast ml-2013-01-14-Feature selection in practice
15 0.1582084 19 fast ml-2013-02-07-The secret of the big guys
16 0.14285406 25 fast ml-2013-04-10-Gender discrimination
17 0.142176 26 fast ml-2013-04-17-Regression as classification
18 0.13623719 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
19 0.13106067 46 fast ml-2013-12-07-13 NIPS papers that caught our eye
20 0.12833473 22 fast ml-2013-03-07-Choosing a machine learning algorithm
topicId topicWeight
[(6, 0.01), (20, 0.25), (26, 0.031), (31, 0.031), (35, 0.038), (55, 0.015), (69, 0.395), (71, 0.082), (84, 0.032), (99, 0.018)]
simIndex simValue blogId blogTitle
same-blog 1 0.93422914 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
2 0.83949357 27 fast ml-2013-05-01-Deep learning made easy
Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends
3 0.83347416 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
4 0.83268666 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
Introduction: Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. They will evaluate the results using mean average precision, or MAP, metric. Specifically MAP@x - this means they ask us to recommend x items for each user. So what is this MAP? First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is MAP. So now, what is AP, or average precision? It may be that we don’t really need to know. But we probably need to know this: we can recommend at most x items for each user it pays to submit all x recommendations, because we are not penalized for bad guesses order matters, so it’s better to submit more certain recommendations fi
5 0.74311 18 fast ml-2013-01-17-A very fast denoising autoencoder
Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension
6 0.7394169 17 fast ml-2013-01-14-Feature selection in practice
7 0.73810762 43 fast ml-2013-11-02-Maxing out the digits
8 0.73651057 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
9 0.71959651 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
10 0.69397861 9 fast ml-2012-10-25-So you want to work for Facebook
11 0.68953425 35 fast ml-2013-08-12-Accelerometer Biometric Competition
12 0.68132836 20 fast ml-2013-02-18-Predicting advertised salaries
13 0.6753276 40 fast ml-2013-10-06-Pylearn2 in practice
14 0.66985333 19 fast ml-2013-02-07-The secret of the big guys
15 0.64362019 16 fast ml-2013-01-12-Intro to random forests
16 0.64170521 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
17 0.63733065 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
18 0.62156528 26 fast ml-2013-04-17-Regression as classification
19 0.61753774 25 fast ml-2013-04-10-Gender discrimination
20 0.60932046 61 fast ml-2014-05-08-Impute missing values with Amelia