fast_ml fast_ml-2013 fast_ml-2013-14 knowledge-graph by maker-knowledge-mining

14 fast ml-2013-01-04-Madelon: Spearmint's revenge


meta infos for this blog

Source: html

Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. [sent-5, score-0.928]

2 That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. [sent-7, score-0.62]

3 The dataset consists of a train, validation and test parts, with labels being available for train and validation. [sent-8, score-0.335]

4 We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. [sent-9, score-0.723]

5 As an error measure we use Area Under Curve , or AUC, which was one of the metrics in the original competition . [sent-10, score-0.373]

6 98 AUC on a test set (training on combined train and validation sets), and it’s still the best! [sent-13, score-0.456]

7 Colors again denote a validation error value (1 - AUC). [sent-14, score-0.55]

8 This time we add red to mark the best results: red : error < 0. [sent-15, score-0.653]

9 10 There seems to be an area in the vicinity of 200 trees that produces good results. [sent-19, score-0.284]

10 a number of trees they were achieved with: In the second run we’ll explore the space between 500 and 1000 trees. [sent-24, score-0.312]

11 Clearly, mtry around 200 seems the best, regardless of a number of trees. [sent-25, score-0.482]

12 Note that Spearmint is very sure about lower mtry values: it hardly explores any setting of less than 150. [sent-26, score-0.685]

13 As a special treat, we provide a visualization of Spearmint sniffing for low error. [sent-27, score-0.232]

14 Colors are inverted, meaning that red signifies the highest error, blue the lowest. [sent-28, score-0.307]

15 Some high-error points were removed to get a closer look at the error range. [sent-29, score-0.364]

16 mtry which confirms that an optimal range for mtry is between 150 and 250: The scores The best result was 0. [sent-33, score-1.043]

17 Let’s check the outcome on a proper validation set. [sent-35, score-0.246]

18 We’ll train two random forests, both with 919 trees, one with with mtry = 169 and one with a default value. [sent-36, score-0.643]

19 Subsequently we used combined training and validation sets for training, and scored 1. [sent-47, score-0.605]

20 No errors on both training sets might suggest overfitting, especially so because there are a lot of people with higher test score than us, but only few with zero errors on training. [sent-50, score-0.667]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('mtry', 0.482), ('error', 0.296), ('spearmint', 0.25), ('validation', 0.186), ('sleep', 0.164), ('sniffing', 0.164), ('trees', 0.163), ('red', 0.139), ('auc', 0.131), ('sets', 0.124), ('setting', 0.122), ('area', 0.121), ('combined', 0.121), ('errors', 0.121), ('colors', 0.109), ('scored', 0.109), ('blue', 0.1), ('default', 0.098), ('test', 0.086), ('lower', 0.081), ('higher', 0.081), ('achieved', 0.081), ('best', 0.079), ('original', 0.077), ('hyperparams', 0.077), ('found', 0.071), ('lot', 0.069), ('next', 0.069), ('high', 0.069), ('bear', 0.068), ('explore', 0.068), ('denote', 0.068), ('decision', 0.068), ('highest', 0.068), ('histogram', 0.068), ('home', 0.068), ('informative', 0.068), ('inspecting', 0.068), ('mixed', 0.068), ('preparing', 0.068), ('removed', 0.068), ('submitted', 0.068), ('subsequently', 0.068), ('visualization', 0.068), ('training', 0.065), ('train', 0.063), ('note', 0.062), ('evaluate', 0.06), ('progress', 0.06), ('proper', 0.06)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000004 14 fast ml-2013-01-04-Madelon: Spearmint's revenge

Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was

2 0.58491415 13 fast ml-2012-12-27-Spearmint with a random forest

Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe

3 0.28218105 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

4 0.15769172 18 fast ml-2013-01-17-A very fast denoising autoencoder

Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension

5 0.14721784 16 fast ml-2013-01-12-Intro to random forests

Introduction: Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive? First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks. There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting). Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods. Besides effectivnes

6 0.13454224 43 fast ml-2013-11-02-Maxing out the digits

7 0.13225283 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

8 0.12467095 20 fast ml-2013-02-18-Predicting advertised salaries

9 0.1047406 17 fast ml-2013-01-14-Feature selection in practice

10 0.10197529 39 fast ml-2013-09-19-What you wanted to know about AUC

11 0.094606578 19 fast ml-2013-02-07-The secret of the big guys

12 0.091186486 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python

13 0.082263641 25 fast ml-2013-04-10-Gender discrimination

14 0.076927893 26 fast ml-2013-04-17-Regression as classification

15 0.075222701 35 fast ml-2013-08-12-Accelerometer Biometric Competition

16 0.06519901 32 fast ml-2013-07-05-Processing large files, line by line

17 0.064532138 61 fast ml-2014-05-08-Impute missing values with Amelia

18 0.062899999 8 fast ml-2012-10-15-Merck challenge

19 0.062107485 27 fast ml-2013-05-01-Deep learning made easy

20 0.056812715 22 fast ml-2013-03-07-Choosing a machine learning algorithm


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.367), (1, 0.322), (2, -0.641), (3, -0.097), (4, -0.002), (5, 0.012), (6, 0.071), (7, 0.009), (8, 0.094), (9, -0.008), (10, 0.028), (11, 0.087), (12, -0.01), (13, 0.04), (14, 0.121), (15, 0.103), (16, -0.073), (17, 0.007), (18, -0.021), (19, 0.014), (20, 0.016), (21, 0.017), (22, 0.013), (23, 0.02), (24, 0.067), (25, 0.014), (26, 0.004), (27, -0.016), (28, -0.053), (29, 0.018), (30, -0.003), (31, -0.062), (32, -0.025), (33, 0.068), (34, -0.042), (35, -0.027), (36, 0.003), (37, 0.042), (38, 0.048), (39, -0.046), (40, 0.077), (41, -0.082), (42, -0.034), (43, 0.091), (44, -0.029), (45, 0.053), (46, -0.1), (47, -0.043), (48, 0.015), (49, 0.059)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97785813 14 fast ml-2013-01-04-Madelon: Spearmint's revenge

Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was

2 0.95835936 13 fast ml-2012-12-27-Spearmint with a random forest

Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe

3 0.54385084 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

4 0.31926489 43 fast ml-2013-11-02-Maxing out the digits

Introduction: Recently we’ve been investigating the basics of Pylearn2 . Now it’s time for a more advanced example: a multilayer perceptron with dropout and maxout activation for the MNIST digits. Maxout explained If you’ve been following developments in deep learning, you know that Hinton’s most recent recommendation for supervised learning, after a few years of bashing backpropagation in favour of unsupervised pretraining, is to use classic multilayer perceptrons with dropout and rectified linear units. For us, this breath of simplicity is a welcome change. Rectified linear is f(x) = max( 0, x ) . This makes backpropagation trivial: for x > 0, the derivative is one, else zero. Note that ReLU consists of two linear functions. But why stop at two? Let’s take max. out of three, or four, or five linear functions… And so maxout is a generalization of ReLU. It can approximate any convex function. Now backpropagation is easy and dropout prevents overfitting, so we can train a deep

5 0.28645065 16 fast ml-2013-01-12-Intro to random forests

Introduction: Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive? First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks. There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting). Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods. Besides effectivnes

6 0.27012593 20 fast ml-2013-02-18-Predicting advertised salaries

7 0.26824215 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

8 0.26112205 18 fast ml-2013-01-17-A very fast denoising autoencoder

9 0.23623253 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python

10 0.21619643 17 fast ml-2013-01-14-Feature selection in practice

11 0.1818644 25 fast ml-2013-04-10-Gender discrimination

12 0.18111801 35 fast ml-2013-08-12-Accelerometer Biometric Competition

13 0.17900552 26 fast ml-2013-04-17-Regression as classification

14 0.17531748 39 fast ml-2013-09-19-What you wanted to know about AUC

15 0.16775972 61 fast ml-2014-05-08-Impute missing values with Amelia

16 0.16462693 19 fast ml-2013-02-07-The secret of the big guys

17 0.16275081 8 fast ml-2012-10-15-Merck challenge

18 0.16103381 27 fast ml-2013-05-01-Deep learning made easy

19 0.15918808 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers

20 0.15855001 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(26, 0.015), (31, 0.014), (35, 0.021), (58, 0.019), (69, 0.799), (71, 0.023), (99, 0.014)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99979597 14 fast ml-2013-01-04-Madelon: Spearmint's revenge

Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was

2 0.99751526 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision

Introduction: Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. They will evaluate the results using mean average precision, or MAP, metric. Specifically MAP@x - this means they ask us to recommend x items for each user. So what is this MAP? First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is MAP. So now, what is AP, or average precision? It may be that we don’t really need to know. But we probably need to know this: we can recommend at most x items for each user it pays to submit all x recommendations, because we are not penalized for bad guesses order matters, so it’s better to submit more certain recommendations fi

3 0.98693818 27 fast ml-2013-05-01-Deep learning made easy

Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends

4 0.95973259 13 fast ml-2012-12-27-Spearmint with a random forest

Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe

5 0.8511712 43 fast ml-2013-11-02-Maxing out the digits

Introduction: Recently we’ve been investigating the basics of Pylearn2 . Now it’s time for a more advanced example: a multilayer perceptron with dropout and maxout activation for the MNIST digits. Maxout explained If you’ve been following developments in deep learning, you know that Hinton’s most recent recommendation for supervised learning, after a few years of bashing backpropagation in favour of unsupervised pretraining, is to use classic multilayer perceptrons with dropout and rectified linear units. For us, this breath of simplicity is a welcome change. Rectified linear is f(x) = max( 0, x ) . This makes backpropagation trivial: for x > 0, the derivative is one, else zero. Note that ReLU consists of two linear functions. But why stop at two? Let’s take max. out of three, or four, or five linear functions… And so maxout is a generalization of ReLU. It can approximate any convex function. Now backpropagation is easy and dropout prevents overfitting, so we can train a deep

6 0.84612674 17 fast ml-2013-01-14-Feature selection in practice

7 0.83519465 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

8 0.82228154 18 fast ml-2013-01-17-A very fast denoising autoencoder

9 0.80385792 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

10 0.79298073 35 fast ml-2013-08-12-Accelerometer Biometric Competition

11 0.76438802 20 fast ml-2013-02-18-Predicting advertised salaries

12 0.7489996 9 fast ml-2012-10-25-So you want to work for Facebook

13 0.73700678 40 fast ml-2013-10-06-Pylearn2 in practice

14 0.73375845 19 fast ml-2013-02-07-The secret of the big guys

15 0.69726211 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

16 0.69623518 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

17 0.68174088 8 fast ml-2012-10-15-Merck challenge

18 0.67955154 61 fast ml-2014-05-08-Impute missing values with Amelia

19 0.67677426 26 fast ml-2013-04-17-Regression as classification

20 0.67459387 25 fast ml-2013-04-10-Gender discrimination