fast_ml fast_ml-2013 fast_ml-2013-18 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension
sentIndex sentText sentNum sentScore
1 About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. [sent-5, score-0.248]
2 mDA takes a matrix of observations, makes it noisy and finds optimal weights for a linear transformation to reconstruct the original values. [sent-14, score-0.394]
3 The main trick of mSDA is marginalizing noise - it means that noise is never actually introduced to the data. [sent-20, score-1.211]
4 Instead, by marginalizing, the algorithm is effectively using infinitely many copies of noisy data to compute the denoising transformation [ Chen ]. [sent-21, score-0.325]
5 We will run Spearmint to optimize two mSDA parameters: a number of stacked layers and a noise level. [sent-59, score-1.142]
6 For now, the noise level will be the same for each layer. [sent-60, score-0.615]
7 If it works, we might check if denoising the sets separately makes any sense. [sent-73, score-0.24]
8 The experiments For starters, we will try 1-10 layers (the original paper used five) and noise in 0. [sent-74, score-0.966]
9 It looks like the optimal noise level is inversely correlated with a number of layers: the more layers, the less noise needed. [sent-85, score-1.333]
10 We will use ten layers and optimize noise separately for each layer - so that there is 10 hyperparams to tune now. [sent-90, score-1.423]
11 To summarize: first layer: low noise layers 1-5: high noise layers 6-10: medium noise However, in 69 tries we didn’t exceed the best results from the constant scenario. [sent-92, score-2.571]
12 It may be that we need more tries to optimize ten hyperparams, but for now it seems that varying noise isn’t going to give us any mega-improvements, so we’ll stick with the simpler constant noise model. [sent-93, score-1.642]
13 Let’s see how it goes with more layers: From the second run we conclude that there are several good settings for layers and noise, provided that there is at least 10 layers. [sent-94, score-0.409]
14 What’s important is to consider the two hyperparams together, because optimal noise for 10 layers will differ from optimal noise for 14 layers. [sent-97, score-1.74]
15 142 for a random forest trained on original data, and 0. [sent-100, score-0.226]
16 UPDATE However, as Andy points out in the comments, better results can be achieved by feeding all layers to a random forest. [sent-107, score-0.475]
17 That is, not only original and final denoised features, but intermediate layers as well. [sent-108, score-0.53]
18 m : x2 = allhx'; % x2 = x2(:, start_i:end_i ); % <--- this one Of course, the dimensionality goes up: ten times with ten layers. [sent-115, score-0.302]
19 Most of the time in optimizing is spent learning random forest models. [sent-119, score-0.231]
20 The conclusion is that if we want to use a random forest for predicting, we need to optimize mSDA hyperparams for a random forest. [sent-125, score-0.443]
wordName wordTfidf (topN-words)
[('noise', 0.563), ('layers', 0.337), ('msda', 0.212), ('err', 0.17), ('denoising', 0.156), ('spearmint', 0.155), ('optimize', 0.129), ('denoised', 0.127), ('layer', 0.118), ('ten', 0.113), ('noisy', 0.113), ('stacked', 0.113), ('autoencoder', 0.106), ('optimal', 0.099), ('autoencoders', 0.093), ('unlabeled', 0.093), ('forest', 0.085), ('andy', 0.085), ('constant', 0.085), ('filtered', 0.085), ('marginalizing', 0.085), ('mda', 0.085), ('varying', 0.085), ('separately', 0.084), ('hyperparams', 0.079), ('dimensionality', 0.076), ('random', 0.075), ('settings', 0.072), ('optimizing', 0.071), ('medium', 0.071), ('regression', 0.067), ('rmse', 0.067), ('original', 0.066), ('achieved', 0.063), ('robot', 0.062), ('arm', 0.062), ('linear', 0.06), ('green', 0.056), ('theory', 0.056), ('transformation', 0.056), ('looks', 0.056), ('dataset', 0.053), ('together', 0.052), ('world', 0.052), ('blue', 0.052), ('tries', 0.052), ('level', 0.052), ('care', 0.052), ('simpler', 0.052), ('network', 0.049)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 18 fast ml-2013-01-17-A very fast denoising autoencoder
Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension
2 0.1644346 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
3 0.15769172 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
4 0.15169109 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
5 0.13589999 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
Introduction: Out of 215 contestants, we placed 8th in the Cats and Dogs competition at Kaggle. The top ten finish gave us the master badge. The competition was about discerning the animals in images and here’s how we did it. We extracted the features using pre-trained deep convolutional networks, specifically decaf and OverFeat . Then we trained some classifiers on these features. The whole thing was inspired by Kyle Kastner’s decaf + pylearn2 combo and we expanded this idea. The classifiers were linear models from scikit-learn and a neural network from Pylearn2 . At the end we created a voting ensemble of the individual models. OverFeat features We touched on OverFeat in Classifying images with a pre-trained deep network . A better way to use it in this competition’s context is to extract the features from the layer before the classifier, as Pierre Sermanet suggested in the comments. Concretely, in the larger OverFeat model ( -l ) layer 24 is the softmax, at least in the
6 0.13447101 27 fast ml-2013-05-01-Deep learning made easy
7 0.097853512 46 fast ml-2013-12-07-13 NIPS papers that caught our eye
8 0.091423213 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
9 0.090308689 19 fast ml-2013-02-07-The secret of the big guys
10 0.087724179 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
11 0.086175598 43 fast ml-2013-11-02-Maxing out the digits
12 0.085501149 20 fast ml-2013-02-18-Predicting advertised salaries
13 0.082152449 16 fast ml-2013-01-12-Intro to random forests
14 0.081546009 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
15 0.07148876 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
16 0.070475429 47 fast ml-2013-12-15-A-B testing with bayesian bandits in Google Analytics
17 0.067301735 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
18 0.06451416 53 fast ml-2014-02-20-Are stocks predictable?
19 0.063447498 17 fast ml-2013-01-14-Feature selection in practice
20 0.060931906 50 fast ml-2014-01-20-How to get predictions from Pylearn2
topicId topicWeight
[(0, 0.312), (1, 0.204), (2, -0.136), (3, 0.024), (4, 0.031), (5, -0.102), (6, 0.01), (7, -0.08), (8, 0.017), (9, -0.019), (10, 0.126), (11, -0.008), (12, 0.106), (13, -0.179), (14, -0.118), (15, -0.15), (16, 0.055), (17, -0.008), (18, 0.18), (19, 0.012), (20, 0.069), (21, 0.03), (22, 0.027), (23, -0.176), (24, 0.023), (25, 0.083), (26, -0.065), (27, 0.014), (28, -0.217), (29, 0.134), (30, 0.019), (31, 0.174), (32, 0.117), (33, -0.013), (34, 0.177), (35, -0.293), (36, 0.193), (37, 0.047), (38, -0.357), (39, 0.049), (40, -0.155), (41, 0.146), (42, 0.269), (43, -0.186), (44, 0.123), (45, -0.038), (46, 0.077), (47, -0.014), (48, -0.178), (49, 0.013)]
simIndex simValue blogId blogTitle
same-blog 1 0.97332442 18 fast ml-2013-01-17-A very fast denoising autoencoder
Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension
2 0.3321034 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
Introduction: Out of 215 contestants, we placed 8th in the Cats and Dogs competition at Kaggle. The top ten finish gave us the master badge. The competition was about discerning the animals in images and here’s how we did it. We extracted the features using pre-trained deep convolutional networks, specifically decaf and OverFeat . Then we trained some classifiers on these features. The whole thing was inspired by Kyle Kastner’s decaf + pylearn2 combo and we expanded this idea. The classifiers were linear models from scikit-learn and a neural network from Pylearn2 . At the end we created a voting ensemble of the individual models. OverFeat features We touched on OverFeat in Classifying images with a pre-trained deep network . A better way to use it in this competition’s context is to extract the features from the layer before the classifier, as Pierre Sermanet suggested in the comments. Concretely, in the larger OverFeat model ( -l ) layer 24 is the softmax, at least in the
3 0.33103245 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
4 0.25862524 27 fast ml-2013-05-01-Deep learning made easy
Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends
5 0.23087616 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
6 0.21541867 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
7 0.1721743 46 fast ml-2013-12-07-13 NIPS papers that caught our eye
8 0.1709376 16 fast ml-2013-01-12-Intro to random forests
9 0.16852251 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
10 0.16734657 20 fast ml-2013-02-18-Predicting advertised salaries
11 0.16601253 43 fast ml-2013-11-02-Maxing out the digits
12 0.15579653 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
13 0.15554859 19 fast ml-2013-02-07-The secret of the big guys
14 0.153083 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
15 0.14250137 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
16 0.14011133 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
17 0.1397201 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
18 0.13606307 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
19 0.13290167 17 fast ml-2013-01-14-Feature selection in practice
20 0.13144377 25 fast ml-2013-04-10-Gender discrimination
topicId topicWeight
[(3, 0.013), (26, 0.046), (31, 0.062), (35, 0.048), (45, 0.34), (50, 0.023), (58, 0.013), (69, 0.206), (71, 0.056), (78, 0.028), (79, 0.039), (99, 0.033)]
simIndex simValue blogId blogTitle
same-blog 1 0.86465812 18 fast ml-2013-01-17-A very fast denoising autoencoder
Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension
2 0.53379583 27 fast ml-2013-05-01-Deep learning made easy
Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends
3 0.52955371 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
4 0.52934581 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
5 0.50848758 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
6 0.506046 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
7 0.49591574 19 fast ml-2013-02-07-The secret of the big guys
8 0.49519584 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
9 0.49081901 17 fast ml-2013-01-14-Feature selection in practice
10 0.48840865 43 fast ml-2013-11-02-Maxing out the digits
11 0.48016232 9 fast ml-2012-10-25-So you want to work for Facebook
12 0.4796446 20 fast ml-2013-02-18-Predicting advertised salaries
13 0.47046506 40 fast ml-2013-10-06-Pylearn2 in practice
14 0.46288022 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
15 0.45790502 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
16 0.4514232 16 fast ml-2013-01-12-Intro to random forests
17 0.4476687 26 fast ml-2013-04-17-Regression as classification
18 0.4404372 61 fast ml-2014-05-08-Impute missing values with Amelia
19 0.44028583 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
20 0.43860102 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit