fast_ml fast_ml-2013 fast_ml-2013-43 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Recently we’ve been investigating the basics of Pylearn2 . Now it’s time for a more advanced example: a multilayer perceptron with dropout and maxout activation for the MNIST digits. Maxout explained If you’ve been following developments in deep learning, you know that Hinton’s most recent recommendation for supervised learning, after a few years of bashing backpropagation in favour of unsupervised pretraining, is to use classic multilayer perceptrons with dropout and rectified linear units. For us, this breath of simplicity is a welcome change. Rectified linear is f(x) = max( 0, x ) . This makes backpropagation trivial: for x > 0, the derivative is one, else zero. Note that ReLU consists of two linear functions. But why stop at two? Let’s take max. out of three, or four, or five linear functions… And so maxout is a generalization of ReLU. It can approximate any convex function. Now backpropagation is easy and dropout prevents overfitting, so we can train a deep
sentIndex sentText sentNum sentScore
1 Now it’s time for a more advanced example: a multilayer perceptron with dropout and maxout activation for the MNIST digits. [sent-2, score-0.572]
2 out of three, or four, or five linear functions… And so maxout is a generalization of ReLU. [sent-10, score-0.23]
3 Now backpropagation is easy and dropout prevents overfitting, so we can train a deep network. [sent-12, score-0.263]
4 Data Pylearn2 provides some code for reproducing results from the maxout paper , including MNIST and CIFAR-10. [sent-14, score-0.23]
5 Additionally, we split the training set for validation and train the model on 38k examples, without further re-training on the full set, which would probably increase accuracy. [sent-20, score-0.364]
6 For more info about running a multilayer perceptron on MNIST, see the tutorial by Ian Goodfellow. [sent-21, score-0.232]
7 Simple vs convoluted The authors of the paper report two scores for MNIST: one for permutation invariant approach and another, better scoring, for a convolutional network. [sent-22, score-0.459]
8 However this makes things slightly more complicated, so for now we stick with permutation invariance. [sent-26, score-0.208]
9 In this case we have a training set with headers and a validation set without them. [sent-38, score-0.46]
10 mnist import MNIST >>> >>> train = MNIST( 'train' ) >>> >>> train. [sent-44, score-0.484]
11 0 Error on the validation set goes down pretty fast with training, here’s a plot for both sets. [sent-52, score-0.27]
12 Training stops when the validation score stops to improve. [sent-53, score-0.456]
13 When to stop training We did change some hyperparams after all: to make things quicker, initially we modfied the so called termination criterion from this: termination_criterion: ! [sent-63, score-0.503]
14 001 N: 10 }, It means “Stop training if the validation error doesn’t decrease in 10 epochs from now”. [sent-71, score-0.67]
15 The original version waits 100 epochs and is OK with zero decrease. [sent-72, score-0.295]
16 With the original settings training runs for 192 + 100 epochs and results in valid_y_misclass : 0. [sent-74, score-0.389]
17 With 240 hiddens training consists of 150 + 100 epochs and validation error is 0. [sent-78, score-0.681]
18 The reason for that might be the validation set - we use 4k examples. [sent-82, score-0.27]
19 Trying to improve the score A faster way to improve the score would be to use 4k outstanding validation examples for training. [sent-84, score-0.602]
20 There’s no labels in the test set, so this time the termination criterion is different: termination_criterion: ! [sent-85, score-0.217]
wordName wordTfidf (topN-words)
[('mnist', 0.415), ('obj', 0.289), ('maxout', 0.23), ('epochs', 0.23), ('permutation', 0.208), ('validation', 0.174), ('backpropagation', 0.153), ('multilayer', 0.153), ('stop', 0.127), ('kaggle', 0.117), ('termination', 0.115), ('invariant', 0.115), ('error', 0.114), ('dropout', 0.11), ('criterion', 0.102), ('surprise', 0.102), ('digits', 0.102), ('score', 0.098), ('set', 0.096), ('training', 0.094), ('rectified', 0.092), ('stops', 0.092), ('worse', 0.092), ('improve', 0.085), ('objective', 0.085), ('basically', 0.079), ('hidden', 0.079), ('path', 0.079), ('perceptron', 0.079), ('value', 0.075), ('scores', 0.075), ('advantage', 0.073), ('consists', 0.069), ('import', 0.069), ('original', 0.065), ('ve', 0.065), ('hyperparams', 0.065), ('examples', 0.062), ('approach', 0.061), ('recommendation', 0.058), ('account', 0.058), ('matches', 0.058), ('trivial', 0.058), ('trying', 0.058), ('improved', 0.058), ('decreased', 0.058), ('shuffle', 0.058), ('decrease', 0.058), ('boolean', 0.058), ('channel', 0.058)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999964 43 fast ml-2013-11-02-Maxing out the digits
Introduction: Recently we’ve been investigating the basics of Pylearn2 . Now it’s time for a more advanced example: a multilayer perceptron with dropout and maxout activation for the MNIST digits. Maxout explained If you’ve been following developments in deep learning, you know that Hinton’s most recent recommendation for supervised learning, after a few years of bashing backpropagation in favour of unsupervised pretraining, is to use classic multilayer perceptrons with dropout and rectified linear units. For us, this breath of simplicity is a welcome change. Rectified linear is f(x) = max( 0, x ) . This makes backpropagation trivial: for x > 0, the derivative is one, else zero. Note that ReLU consists of two linear functions. But why stop at two? Let’s take max. out of three, or four, or five linear functions… And so maxout is a generalization of ReLU. It can approximate any convex function. Now backpropagation is easy and dropout prevents overfitting, so we can train a deep
2 0.18208717 40 fast ml-2013-10-06-Pylearn2 in practice
Introduction: What do you get when you mix one part brilliant and one part daft? You get Pylearn2, a cutting edge neural networks library from Montreal that’s rather hard to use. Here we’ll show how to get through the daft part with your mental health relatively intact. Pylearn2 comes from the Lisa Lab in Montreal , led by Yoshua Bengio. Those are pretty smart guys and they concern themselves with deep learning. Recently they published a paper entitled Pylearn2: a machine learning research library [arxiv] . Here’s a quote: Pylearn2 is a machine learning research library - its users are researchers . This means (…) it is acceptable to assume that the user has some technical sophistication and knowledge of machine learning. The word research is possibly the most common word in the paper. There’s a reason for that: the library is certainly not production-ready. OK, it’s not that bad. There are only two difficult things: getting your data in getting predictions out What’
3 0.16230737 58 fast ml-2014-04-12-Deep learning these days
Introduction: It seems that quite a few people with interest in deep learning think of it in terms of unsupervised pre-training, autoencoders, stacked RBMs and deep belief networks. It’s easy to get into this groove by watching one of Geoff Hinton’s videos from a few years ago, where he bashes backpropagation in favour of unsupervised methods that are able to discover the structure in data by themselves, the same way as human brain does. Those videos, papers and tutorials linger. They were state of the art once, but things have changed since then. These days supervised learning is the king again. This has to do with the fact that you can look at data from many different angles and usually you’d prefer representation that is useful for the discriminative task at hand . Unsupervised learning will find some angle, but will it be the one you want? In case of the MNIST digits, sure. Otherwise probably not. Or maybe it will find a lot of angles while you only need one. Ladies and gentlemen, pleas
4 0.13839172 27 fast ml-2013-05-01-Deep learning made easy
Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends
5 0.13454224 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
6 0.12880133 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
7 0.11600675 20 fast ml-2013-02-18-Predicting advertised salaries
8 0.11526017 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
9 0.10968812 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
10 0.10702022 26 fast ml-2013-04-17-Regression as classification
11 0.10008932 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
12 0.09978155 13 fast ml-2012-12-27-Spearmint with a random forest
13 0.09807732 25 fast ml-2013-04-10-Gender discrimination
14 0.092318006 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
15 0.089675061 19 fast ml-2013-02-07-The secret of the big guys
16 0.086175598 18 fast ml-2013-01-17-A very fast denoising autoencoder
17 0.082663491 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
18 0.082525246 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
19 0.081435159 32 fast ml-2013-07-05-Processing large files, line by line
20 0.081245072 50 fast ml-2014-01-20-How to get predictions from Pylearn2
topicId topicWeight
[(0, 0.343), (1, 0.111), (2, -0.042), (3, 0.022), (4, -0.033), (5, -0.107), (6, 0.066), (7, 0.203), (8, -0.385), (9, 0.073), (10, -0.06), (11, 0.117), (12, -0.104), (13, 0.114), (14, -0.07), (15, 0.102), (16, 0.103), (17, -0.02), (18, 0.078), (19, 0.064), (20, 0.0), (21, -0.021), (22, -0.026), (23, 0.087), (24, -0.094), (25, -0.083), (26, 0.213), (27, -0.075), (28, 0.033), (29, 0.04), (30, -0.003), (31, -0.025), (32, -0.158), (33, 0.093), (34, 0.167), (35, 0.004), (36, -0.007), (37, -0.08), (38, -0.192), (39, 0.083), (40, -0.061), (41, 0.128), (42, -0.025), (43, 0.1), (44, -0.196), (45, 0.204), (46, 0.112), (47, -0.363), (48, 0.088), (49, 0.139)]
simIndex simValue blogId blogTitle
same-blog 1 0.97352087 43 fast ml-2013-11-02-Maxing out the digits
Introduction: Recently we’ve been investigating the basics of Pylearn2 . Now it’s time for a more advanced example: a multilayer perceptron with dropout and maxout activation for the MNIST digits. Maxout explained If you’ve been following developments in deep learning, you know that Hinton’s most recent recommendation for supervised learning, after a few years of bashing backpropagation in favour of unsupervised pretraining, is to use classic multilayer perceptrons with dropout and rectified linear units. For us, this breath of simplicity is a welcome change. Rectified linear is f(x) = max( 0, x ) . This makes backpropagation trivial: for x > 0, the derivative is one, else zero. Note that ReLU consists of two linear functions. But why stop at two? Let’s take max. out of three, or four, or five linear functions… And so maxout is a generalization of ReLU. It can approximate any convex function. Now backpropagation is easy and dropout prevents overfitting, so we can train a deep
2 0.33466992 40 fast ml-2013-10-06-Pylearn2 in practice
Introduction: What do you get when you mix one part brilliant and one part daft? You get Pylearn2, a cutting edge neural networks library from Montreal that’s rather hard to use. Here we’ll show how to get through the daft part with your mental health relatively intact. Pylearn2 comes from the Lisa Lab in Montreal , led by Yoshua Bengio. Those are pretty smart guys and they concern themselves with deep learning. Recently they published a paper entitled Pylearn2: a machine learning research library [arxiv] . Here’s a quote: Pylearn2 is a machine learning research library - its users are researchers . This means (…) it is acceptable to assume that the user has some technical sophistication and knowledge of machine learning. The word research is possibly the most common word in the paper. There’s a reason for that: the library is certainly not production-ready. OK, it’s not that bad. There are only two difficult things: getting your data in getting predictions out What’
3 0.33091596 58 fast ml-2014-04-12-Deep learning these days
Introduction: It seems that quite a few people with interest in deep learning think of it in terms of unsupervised pre-training, autoencoders, stacked RBMs and deep belief networks. It’s easy to get into this groove by watching one of Geoff Hinton’s videos from a few years ago, where he bashes backpropagation in favour of unsupervised methods that are able to discover the structure in data by themselves, the same way as human brain does. Those videos, papers and tutorials linger. They were state of the art once, but things have changed since then. These days supervised learning is the king again. This has to do with the fact that you can look at data from many different angles and usually you’d prefer representation that is useful for the discriminative task at hand . Unsupervised learning will find some angle, but will it be the one you want? In case of the MNIST digits, sure. Otherwise probably not. Or maybe it will find a lot of angles while you only need one. Ladies and gentlemen, pleas
4 0.32190511 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
5 0.27513385 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
6 0.26403353 19 fast ml-2013-02-07-The secret of the big guys
7 0.26307264 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
8 0.24579445 13 fast ml-2012-12-27-Spearmint with a random forest
9 0.24122189 25 fast ml-2013-04-10-Gender discrimination
10 0.23994853 27 fast ml-2013-05-01-Deep learning made easy
11 0.23303735 26 fast ml-2013-04-17-Regression as classification
12 0.21602036 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
13 0.21453376 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
14 0.20458031 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
15 0.20222214 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
16 0.19011901 18 fast ml-2013-01-17-A very fast denoising autoencoder
17 0.17326336 46 fast ml-2013-12-07-13 NIPS papers that caught our eye
18 0.16303517 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
19 0.16067041 57 fast ml-2014-04-01-Exclusive Geoff Hinton interview
20 0.16015047 17 fast ml-2013-01-14-Feature selection in practice
topicId topicWeight
[(24, 0.362), (26, 0.035), (31, 0.039), (35, 0.044), (43, 0.02), (51, 0.017), (55, 0.026), (58, 0.022), (69, 0.19), (71, 0.027), (78, 0.034), (79, 0.021), (97, 0.031), (99, 0.054)]
simIndex simValue blogId blogTitle
same-blog 1 0.83726424 43 fast ml-2013-11-02-Maxing out the digits
Introduction: Recently we’ve been investigating the basics of Pylearn2 . Now it’s time for a more advanced example: a multilayer perceptron with dropout and maxout activation for the MNIST digits. Maxout explained If you’ve been following developments in deep learning, you know that Hinton’s most recent recommendation for supervised learning, after a few years of bashing backpropagation in favour of unsupervised pretraining, is to use classic multilayer perceptrons with dropout and rectified linear units. For us, this breath of simplicity is a welcome change. Rectified linear is f(x) = max( 0, x ) . This makes backpropagation trivial: for x > 0, the derivative is one, else zero. Note that ReLU consists of two linear functions. But why stop at two? Let’s take max. out of three, or four, or five linear functions… And so maxout is a generalization of ReLU. It can approximate any convex function. Now backpropagation is easy and dropout prevents overfitting, so we can train a deep
2 0.48115665 27 fast ml-2013-05-01-Deep learning made easy
Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends
3 0.46912071 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
4 0.46410251 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
5 0.4603855 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
6 0.45976153 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
7 0.4595564 40 fast ml-2013-10-06-Pylearn2 in practice
8 0.45550033 19 fast ml-2013-02-07-The secret of the big guys
9 0.4516634 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
10 0.44904664 18 fast ml-2013-01-17-A very fast denoising autoencoder
11 0.44706661 9 fast ml-2012-10-25-So you want to work for Facebook
12 0.44558093 17 fast ml-2013-01-14-Feature selection in practice
13 0.4273971 58 fast ml-2014-04-12-Deep learning these days
14 0.42150897 20 fast ml-2013-02-18-Predicting advertised salaries
15 0.41695958 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
16 0.41060537 35 fast ml-2013-08-12-Accelerometer Biometric Competition
17 0.41021109 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
18 0.41013515 16 fast ml-2013-01-12-Intro to random forests
19 0.40358907 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
20 0.40037021 32 fast ml-2013-07-05-Processing large files, line by line