fast_ml fast_ml-2012 fast_ml-2012-12 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
sentIndex sentText sentNum sentScore
1 5) -p epsilon : set the epsilon in loss function of epsilon-SVR (default 0. [sent-10, score-0.491]
2 1) -m cachesize : set cache memory size in MB (default 100) -e epsilon : set tolerance of termination criterion (default 0. [sent-11, score-0.38]
3 There are two pieces to this puzzle: a config file and a wrapper file . [sent-25, score-0.519]
4 In the config file, you define the hyperparams you would like to optimize, and provide some data about the wrapper file, namely a name and a language. [sent-26, score-0.509]
5 The wrapper file works as a black box: Spearmint will pass to it hyperparams values in a Python dictionary and expects to get back a single number, which is a measure of optimality of those hyperparams. [sent-30, score-0.465]
6 An importal note: the format of params dictionary is not name: value , it’s name: [ value ] - each key points to a list of values. [sent-31, score-0.371]
7 That’s because you can have as many actual params as you want under one name (it’s the size variable in config), and they are passed as a list. [sent-32, score-0.267]
8 No big deal here, the script receives a learning rate value as a command line argument and then calls the proper function with the received parameter. [sent-40, score-0.427]
9 The output will contain a validation error value which we would like to be as low as possible. [sent-42, score-0.257]
10 Notice that the error curve has a definitely convex shape and the software gets to the point pretty quickly. [sent-50, score-0.349]
11 The chart above refers to optimizing log learning rate. [sent-67, score-0.313]
12 First we tried optimizing the learning rate without taking a logartithm. [sent-68, score-0.38]
13 Instead of exploring space to the left, where error is clearly lower, Spearmint focuses for some reason on 0. [sent-70, score-0.324]
14 Only after switching to optimizing log rate everything went right. [sent-74, score-0.465]
15 That poses a potential problem, because when optimizing multiple parameters you won’t be able to plot the error hyperplane to validate it visually. [sent-75, score-0.315]
16 If you’re wondering, an apparent reason for this strange behaviour was a high error at the lower bound. [sent-76, score-0.475]
17 A learning rate lies in a range of (0, 1), so we set the lower bound at 0. [sent-77, score-0.589]
18 Spearmint tries extremes first, and at the lower bound the error is high. [sent-79, score-0.44]
19 One hypotesis is that Spearmint is unwilling to explore the space to the left because at the bound the error is so high. [sent-81, score-0.39]
20 However, in another experiment this wasn’t a problem: An apparent difference is that the error is higher at the upper bound than at the lower bound, and also in the shape of the error curve, which is not linear here. [sent-82, score-0.756]
wordName wordTfidf (topN-words)
[('spearmint', 0.337), ('default', 0.26), ('rate', 0.213), ('config', 0.167), ('gamma', 0.167), ('optimizing', 0.167), ('bound', 0.167), ('kernel', 0.159), ('epsilon', 0.151), ('error', 0.148), ('curve', 0.134), ('wrapper', 0.134), ('lower', 0.125), ('name', 0.114), ('degree', 0.111), ('value', 0.109), ('file', 0.109), ('function', 0.105), ('apparent', 0.101), ('exploring', 0.101), ('nu', 0.101), ('shrinking', 0.101), ('strange', 0.101), ('tail', 0.101), ('hyperparams', 0.094), ('params', 0.092), ('log', 0.085), ('tune', 0.085), ('charts', 0.084), ('octave', 0.084), ('promise', 0.084), ('protocol', 0.084), ('set', 0.084), ('options', 0.08), ('space', 0.075), ('svm', 0.075), ('beginning', 0.074), ('human', 0.074), ('practice', 0.071), ('weight', 0.067), ('pass', 0.067), ('shape', 0.067), ('parameter', 0.067), ('matlab', 0.063), ('type', 0.061), ('call', 0.061), ('dictionary', 0.061), ('size', 0.061), ('mode', 0.061), ('chart', 0.061)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999976 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
2 0.28218105 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
3 0.24709208 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
4 0.18929356 19 fast ml-2013-02-07-The secret of the big guys
Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni
5 0.1644346 18 fast ml-2013-01-17-A very fast denoising autoencoder
Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension
6 0.13417998 20 fast ml-2013-02-18-Predicting advertised salaries
7 0.12691852 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
8 0.11722906 40 fast ml-2013-10-06-Pylearn2 in practice
9 0.11039407 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
10 0.10968812 43 fast ml-2013-11-02-Maxing out the digits
11 0.10634075 32 fast ml-2013-07-05-Processing large files, line by line
12 0.10481053 27 fast ml-2013-05-01-Deep learning made easy
13 0.10389057 25 fast ml-2013-04-10-Gender discrimination
14 0.10107192 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
15 0.099706687 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
16 0.089732103 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
17 0.089415044 33 fast ml-2013-07-09-Introducing phraug
18 0.086288206 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
19 0.085365981 17 fast ml-2013-01-14-Feature selection in practice
20 0.081562281 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
topicId topicWeight
[(0, 0.407), (1, 0.142), (2, -0.326), (3, -0.037), (4, -0.001), (5, 0.018), (6, 0.027), (7, 0.139), (8, 0.048), (9, -0.042), (10, 0.037), (11, 0.064), (12, 0.145), (13, -0.11), (14, 0.078), (15, -0.04), (16, 0.113), (17, 0.037), (18, 0.001), (19, -0.139), (20, -0.173), (21, -0.04), (22, 0.128), (23, -0.098), (24, -0.031), (25, 0.161), (26, 0.009), (27, 0.005), (28, -0.06), (29, 0.057), (30, 0.029), (31, -0.07), (32, 0.04), (33, -0.101), (34, 0.028), (35, -0.084), (36, -0.123), (37, -0.047), (38, 0.117), (39, -0.002), (40, -0.045), (41, 0.083), (42, -0.041), (43, 0.015), (44, 0.005), (45, -0.187), (46, 0.158), (47, 0.181), (48, -0.0), (49, -0.152)]
simIndex simValue blogId blogTitle
same-blog 1 0.97255743 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
2 0.52468145 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
3 0.5213806 19 fast ml-2013-02-07-The secret of the big guys
Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni
4 0.5072884 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
5 0.38115337 18 fast ml-2013-01-17-A very fast denoising autoencoder
Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension
6 0.37145877 40 fast ml-2013-10-06-Pylearn2 in practice
7 0.33279407 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
8 0.29545304 20 fast ml-2013-02-18-Predicting advertised salaries
9 0.29061034 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
10 0.2723482 25 fast ml-2013-04-10-Gender discrimination
11 0.26561859 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
12 0.2575776 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
13 0.25595164 32 fast ml-2013-07-05-Processing large files, line by line
14 0.23437673 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
15 0.22051279 27 fast ml-2013-05-01-Deep learning made easy
16 0.21732683 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
17 0.2131782 36 fast ml-2013-08-23-A bag of words and a nice little network
18 0.20676531 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
19 0.19722304 57 fast ml-2014-04-01-Exclusive Geoff Hinton interview
20 0.196398 33 fast ml-2013-07-09-Introducing phraug
topicId topicWeight
[(6, 0.025), (26, 0.055), (31, 0.052), (35, 0.036), (41, 0.018), (55, 0.036), (69, 0.229), (71, 0.038), (78, 0.031), (79, 0.332), (84, 0.015), (99, 0.043)]
simIndex simValue blogId blogTitle
1 0.98004723 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
Introduction: Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are: a neural network with a single hidden layer automatic creation of polynomial, specifically quadratic and cubic, features N-grams We describe how to use them, providing examples from the Kaggle Amazon competition and for the kin8nm dataset. Neural network The original motivation for creating neural network code in VW was to win some Kaggle competitions using only vee-dub , and that goal becomes much more feasible once you have a strong non-linear learner. The network seems to be a classic multi-layer perceptron with one sigmoidal hidden layer. More interestingly, it has dropout. Unfortunately, in a few tries we haven’t had much luck with the dropout. Here’s an example of how to create a network with 10 hidden units: vw -d data.vw --nn 10 Quadratic and cubic features The idea of quadratic features is to create all possible combinations between original features, so that
same-blog 2 0.90475911 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
3 0.60403645 17 fast ml-2013-01-14-Feature selection in practice
Introduction: Lately we’ve been working with the Madelon dataset. It was originally prepared for a feature selection challenge, so while we’re at it, let’s select some features. Madelon has 500 attributes, 20 of which are real, the rest being noise. Hence the ideal scenario would be to select just those 20 features. Fortunately we know just the right software for this task. It’s called mRMR , for minimum Redundancy Maximum Relevance , and is available in C and Matlab versions for various platforms. mRMR expects a CSV file with labels in the first column and feature names in the first row. So the game plan is: combine training and validation sets into a format expected by mRMR run selection filter the original datasets, discarding all features but the selected ones evaluate the results on the validation set if all goes well, prepare and submit files for the competition We’ll use R scripts for all the steps but feature selection. Now a few words about mRMR. It will show you p
4 0.59979731 18 fast ml-2013-01-17-A very fast denoising autoencoder
Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension
5 0.58630526 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
6 0.58142638 27 fast ml-2013-05-01-Deep learning made easy
7 0.56537086 43 fast ml-2013-11-02-Maxing out the digits
8 0.56252605 36 fast ml-2013-08-23-A bag of words and a nice little network
9 0.55862927 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
10 0.55756313 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
11 0.55025846 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
12 0.54807109 61 fast ml-2014-05-08-Impute missing values with Amelia
13 0.54689372 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
14 0.5443508 20 fast ml-2013-02-18-Predicting advertised salaries
15 0.54049736 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
16 0.53653717 19 fast ml-2013-02-07-The secret of the big guys
17 0.53512871 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
18 0.53495073 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
19 0.52712858 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
20 0.52520835 26 fast ml-2013-04-17-Regression as classification