fast_ml fast_ml-2013 fast_ml-2013-17 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Lately we’ve been working with the Madelon dataset. It was originally prepared for a feature selection challenge, so while we’re at it, let’s select some features. Madelon has 500 attributes, 20 of which are real, the rest being noise. Hence the ideal scenario would be to select just those 20 features. Fortunately we know just the right software for this task. It’s called mRMR , for minimum Redundancy Maximum Relevance , and is available in C and Matlab versions for various platforms. mRMR expects a CSV file with labels in the first column and feature names in the first row. So the game plan is: combine training and validation sets into a format expected by mRMR run selection filter the original datasets, discarding all features but the selected ones evaluate the results on the validation set if all goes well, prepare and submit files for the competition We’ll use R scripts for all the steps but feature selection. Now a few words about mRMR. It will show you p
sentIndex sentText sentNum sentScore
1 It was originally prepared for a feature selection challenge, so while we’re at it, let’s select some features. [sent-2, score-0.558]
2 Hence the ideal scenario would be to select just those 20 features. [sent-4, score-0.356]
3 It’s called mRMR , for minimum Redundancy Maximum Relevance , and is available in C and Matlab versions for various platforms. [sent-6, score-0.126]
4 mRMR expects a CSV file with labels in the first column and feature names in the first row. [sent-7, score-0.185]
5 It will show you possible options when run without parameters. [sent-10, score-0.091]
6 The first is used to select a method, and we stick with default. [sent-13, score-0.227]
7 The second one is a threshold for discretization. [sent-15, score-0.378]
8 It has to do with the fact that mRMR needs to discretize feature values. [sent-16, score-0.257]
9 With the threshold at zero ( -t 0 ), it will just binarize: “above the mean” and “below the mean”. [sent-17, score-0.378]
10 If you specify a threshold, there will be three brackets, marked by two points: the mean - t * standard deviation the mean + t * standard deviation A threshold of one seems to be working well here. [sent-18, score-1.152]
11 We’ll ask to select 20 features and use 10000 samples (or all available): mrmr -i data\combined_train_val. [sent-19, score-0.886]
12 00*sigma #fea=20 selection method=MID #maxVar=10000 #maxSample=10000 Target classification variable (#1 column in the input data) has name=y_combined entropy score=1. [sent-21, score-0.29]
13 000 *** MaxRel features *** Order Fea Name Score 1 339 V339 0. [sent-22, score-0.092]
14 002 *** mRMR features *** Order Fea Name Score 1 339 V339 0. [sent-42, score-0.092]
15 001 ^C You’ll notice that there are two sets of features: MaxRel and mRMR . [sent-53, score-0.065]
16 The first set takes only a short while to select, while the second needs time quadratic in the number of features, so with each additional feature you wait longer and longer. [sent-54, score-0.257]
17 We expect at least a few attributes with relatively higher scores and that’s what we get from MaxRel (the chart), but not from mRMR . [sent-56, score-0.126]
18 Where you cut off is a matter of some testing, we go with 13 attributes. [sent-57, score-0.072]
19 This is consistent with R indexing, so now we just copy and paste selected indexes into an R script and proceed: mrmr_indexes = c( 339, 242, 476, 337, 65, 473, 443, 129, 106, 49, 454, 494, 379 ) Turns out that the process indeed improves the results: we get AUC = 0. [sent-59, score-0.383]
20 To obtain even better score, we could write a few scripts and run Spearmint to optimize the threshold, a number of features used and a number of trees in a forest. [sent-63, score-0.254]
wordName wordTfidf (topN-words)
[('mrmr', 0.504), ('threshold', 0.378), ('fea', 0.258), ('select', 0.227), ('maxrel', 0.215), ('selection', 0.146), ('deviation', 0.143), ('attributes', 0.126), ('madelon', 0.114), ('mean', 0.113), ('feature', 0.113), ('scripts', 0.105), ('name', 0.098), ('features', 0.092), ('order', 0.091), ('options', 0.091), ('standard', 0.091), ('selected', 0.091), ('indexes', 0.085), ('working', 0.08), ('needs', 0.072), ('column', 0.072), ('prepared', 0.072), ('consistent', 0.072), ('cut', 0.072), ('discretize', 0.072), ('entropy', 0.072), ('ideal', 0.072), ('improves', 0.072), ('maximum', 0.072), ('quadratic', 0.072), ('redundancy', 0.072), ('score', 0.069), ('sets', 0.065), ('method', 0.065), ('evaluate', 0.063), ('filter', 0.063), ('consult', 0.063), ('expected', 0.063), ('indexing', 0.063), ('lately', 0.063), ('minimum', 0.063), ('paste', 0.063), ('samples', 0.063), ('specified', 0.063), ('versions', 0.063), ('submit', 0.057), ('obtain', 0.057), ('steps', 0.057), ('scenario', 0.057)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999994 17 fast ml-2013-01-14-Feature selection in practice
Introduction: Lately we’ve been working with the Madelon dataset. It was originally prepared for a feature selection challenge, so while we’re at it, let’s select some features. Madelon has 500 attributes, 20 of which are real, the rest being noise. Hence the ideal scenario would be to select just those 20 features. Fortunately we know just the right software for this task. It’s called mRMR , for minimum Redundancy Maximum Relevance , and is available in C and Matlab versions for various platforms. mRMR expects a CSV file with labels in the first column and feature names in the first row. So the game plan is: combine training and validation sets into a format expected by mRMR run selection filter the original datasets, discarding all features but the selected ones evaluate the results on the validation set if all goes well, prepare and submit files for the competition We’ll use R scripts for all the steps but feature selection. Now a few words about mRMR. It will show you p
2 0.1364352 25 fast ml-2013-04-10-Gender discrimination
Introduction: There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not? Congratulations! You have spotted the ceiling cat. As Sashi noticed on the forums , it’s not difficult to improve on the benchmarks a little bit. In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. Here’s our version of the story. Let’s start with normalizing. There’s a nice function for that in R, scale() . The dataset is small, 1128 examples, so we can go ahead and use R. Turns out that in its raw form, the data won’t scale. That’s because apparently there are some columns with zeros only. That makes it difficult to divide. Fortunately, we know just the right tool for the task. We learned about it on Kaggle forums too. It’s a function in caret package called nearZeroVar() . It will give you indexes of all the columns which have near zero var
3 0.13120295 39 fast ml-2013-09-19-What you wanted to know about AUC
Introduction: AUC, or Area Under Curve, is a metric for binary classification. It’s probably the second most popular one, after accuracy. Unfortunately, it’s nowhere near as intuitive. That is, until you have read this article. Accuracy deals with ones and zeros, meaning you either got the class label right or you didn’t. But many classifiers are able to quantify their uncertainty about the answer by outputting a probability value. To compute accuracy from probabilities you need a threshold to decide when zero turns into one. The most natural threshold is of course 0.5. Let’s suppose you have a quirky classifier. It is able to get all the answers right, but it outputs 0.7 for negative examples and 0.9 for positive examples. Clearly, a threshold of 0.5 won’t get you far here. But 0.8 would be just perfect. That’s the whole point of using AUC - it considers all possible thresholds. Various thresholds result in different true positive/false positive rates. As you decrease the threshold, you get
4 0.1047406 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
5 0.094269231 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
Introduction: The Black Box challenge has just ended. We were thoroughly thrilled to learn that the winner, doubleshot , used sparse filtering, apparently following our cue. His score in terms of accuracy is 0.702, ours 0.645, and the best benchmark 0.525. We ranked 15th out of 217, a few places ahead of the Toronto team consisting of Charlie Tang and Nitish Srivastava . To their credit, Charlie has won the two remaining Challenges in Representation Learning . Not-so-deep learning The difference to our previous, beating-the-benchmark attempt is twofold: one layer instead of two for supervised learning, VW instead of a random forest Somewhat suprisingly, one layer works better than two. Even more surprisingly, with enough units you can get 0.634 using a linear model (Vowpal Wabbit, of course, One-Against-All). In our understanding, that’s the point of overcomplete representations*, which Stanford people seem to care much about. Recall The secret of the big guys and the pape
6 0.090243548 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
7 0.085365981 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
8 0.080321185 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
9 0.080019966 33 fast ml-2013-07-09-Introducing phraug
10 0.078080341 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
11 0.076875411 43 fast ml-2013-11-02-Maxing out the digits
12 0.074313998 19 fast ml-2013-02-07-The secret of the big guys
13 0.073086999 13 fast ml-2012-12-27-Spearmint with a random forest
14 0.07282313 20 fast ml-2013-02-18-Predicting advertised salaries
15 0.071494848 35 fast ml-2013-08-12-Accelerometer Biometric Competition
16 0.066312253 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
17 0.066154458 32 fast ml-2013-07-05-Processing large files, line by line
18 0.064097531 46 fast ml-2013-12-07-13 NIPS papers that caught our eye
19 0.063663423 10 fast ml-2012-11-17-The Facebook challenge HOWTO
20 0.063447498 18 fast ml-2013-01-17-A very fast denoising autoencoder
topicId topicWeight
[(0, 0.26), (1, -0.08), (2, -0.135), (3, -0.032), (4, -0.02), (5, 0.009), (6, -0.098), (7, 0.038), (8, -0.018), (9, 0.115), (10, 0.248), (11, -0.384), (12, -0.094), (13, -0.027), (14, 0.16), (15, -0.06), (16, 0.202), (17, -0.193), (18, -0.002), (19, 0.016), (20, -0.066), (21, -0.208), (22, -0.039), (23, 0.032), (24, 0.102), (25, -0.374), (26, 0.056), (27, 0.027), (28, -0.077), (29, -0.001), (30, 0.096), (31, 0.103), (32, -0.073), (33, 0.008), (34, 0.008), (35, 0.177), (36, 0.086), (37, -0.034), (38, -0.049), (39, 0.036), (40, 0.098), (41, -0.002), (42, 0.04), (43, 0.07), (44, -0.192), (45, 0.178), (46, 0.022), (47, 0.25), (48, -0.304), (49, -0.062)]
simIndex simValue blogId blogTitle
same-blog 1 0.97880894 17 fast ml-2013-01-14-Feature selection in practice
Introduction: Lately we’ve been working with the Madelon dataset. It was originally prepared for a feature selection challenge, so while we’re at it, let’s select some features. Madelon has 500 attributes, 20 of which are real, the rest being noise. Hence the ideal scenario would be to select just those 20 features. Fortunately we know just the right software for this task. It’s called mRMR , for minimum Redundancy Maximum Relevance , and is available in C and Matlab versions for various platforms. mRMR expects a CSV file with labels in the first column and feature names in the first row. So the game plan is: combine training and validation sets into a format expected by mRMR run selection filter the original datasets, discarding all features but the selected ones evaluate the results on the validation set if all goes well, prepare and submit files for the competition We’ll use R scripts for all the steps but feature selection. Now a few words about mRMR. It will show you p
2 0.22356066 25 fast ml-2013-04-10-Gender discrimination
Introduction: There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not? Congratulations! You have spotted the ceiling cat. As Sashi noticed on the forums , it’s not difficult to improve on the benchmarks a little bit. In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. Here’s our version of the story. Let’s start with normalizing. There’s a nice function for that in R, scale() . The dataset is small, 1128 examples, so we can go ahead and use R. Turns out that in its raw form, the data won’t scale. That’s because apparently there are some columns with zeros only. That makes it difficult to divide. Fortunately, we know just the right tool for the task. We learned about it on Kaggle forums too. It’s a function in caret package called nearZeroVar() . It will give you indexes of all the columns which have near zero var
3 0.18505229 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
Introduction: The Black Box challenge has just ended. We were thoroughly thrilled to learn that the winner, doubleshot , used sparse filtering, apparently following our cue. His score in terms of accuracy is 0.702, ours 0.645, and the best benchmark 0.525. We ranked 15th out of 217, a few places ahead of the Toronto team consisting of Charlie Tang and Nitish Srivastava . To their credit, Charlie has won the two remaining Challenges in Representation Learning . Not-so-deep learning The difference to our previous, beating-the-benchmark attempt is twofold: one layer instead of two for supervised learning, VW instead of a random forest Somewhat suprisingly, one layer works better than two. Even more surprisingly, with enough units you can get 0.634 using a linear model (Vowpal Wabbit, of course, One-Against-All). In our understanding, that’s the point of overcomplete representations*, which Stanford people seem to care much about. Recall The secret of the big guys and the pape
4 0.1818174 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
Introduction: Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are: a neural network with a single hidden layer automatic creation of polynomial, specifically quadratic and cubic, features N-grams We describe how to use them, providing examples from the Kaggle Amazon competition and for the kin8nm dataset. Neural network The original motivation for creating neural network code in VW was to win some Kaggle competitions using only vee-dub , and that goal becomes much more feasible once you have a strong non-linear learner. The network seems to be a classic multi-layer perceptron with one sigmoidal hidden layer. More interestingly, it has dropout. Unfortunately, in a few tries we haven’t had much luck with the dropout. Here’s an example of how to create a network with 10 hidden units: vw -d data.vw --nn 10 Quadratic and cubic features The idea of quadratic features is to create all possible combinations between original features, so that
5 0.17875311 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
Introduction: The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples. We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set. What you wanted to know about Lasso and Ridge L1 and L2 are both ways of regularization sometimes called weight decay . Basically, we include parameter weights in a cost function. In effect, the model will try to minimize those weights by going “down the slope”. Example weights: in a linear model or in a neural network. L1 is known as Lasso and L2 is known as Ridge. These names may be confusing, because a chart of Lasso looks like a ridge and a
6 0.1689001 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
7 0.16143014 39 fast ml-2013-09-19-What you wanted to know about AUC
8 0.16079658 13 fast ml-2012-12-27-Spearmint with a random forest
9 0.15748645 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
10 0.14948733 19 fast ml-2013-02-07-The secret of the big guys
11 0.14704759 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
12 0.14301814 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
13 0.14145061 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
14 0.13319786 43 fast ml-2013-11-02-Maxing out the digits
15 0.13182022 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
16 0.1318094 35 fast ml-2013-08-12-Accelerometer Biometric Competition
17 0.13003524 18 fast ml-2013-01-17-A very fast denoising autoencoder
18 0.12985583 46 fast ml-2013-12-07-13 NIPS papers that caught our eye
19 0.12707736 26 fast ml-2013-04-17-Regression as classification
20 0.12467614 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
topicId topicWeight
[(6, 0.026), (26, 0.062), (35, 0.041), (55, 0.042), (69, 0.222), (71, 0.027), (78, 0.022), (79, 0.045), (81, 0.011), (85, 0.372), (99, 0.037)]
simIndex simValue blogId blogTitle
same-blog 1 0.86149299 17 fast ml-2013-01-14-Feature selection in practice
Introduction: Lately we’ve been working with the Madelon dataset. It was originally prepared for a feature selection challenge, so while we’re at it, let’s select some features. Madelon has 500 attributes, 20 of which are real, the rest being noise. Hence the ideal scenario would be to select just those 20 features. Fortunately we know just the right software for this task. It’s called mRMR , for minimum Redundancy Maximum Relevance , and is available in C and Matlab versions for various platforms. mRMR expects a CSV file with labels in the first column and feature names in the first row. So the game plan is: combine training and validation sets into a format expected by mRMR run selection filter the original datasets, discarding all features but the selected ones evaluate the results on the validation set if all goes well, prepare and submit files for the competition We’ll use R scripts for all the steps but feature selection. Now a few words about mRMR. It will show you p
2 0.51673228 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
3 0.51167953 27 fast ml-2013-05-01-Deep learning made easy
Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends
4 0.50934035 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
5 0.50480419 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
6 0.50466454 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
7 0.47600996 43 fast ml-2013-11-02-Maxing out the digits
8 0.47006208 18 fast ml-2013-01-17-A very fast denoising autoencoder
9 0.46113804 20 fast ml-2013-02-18-Predicting advertised salaries
10 0.45914981 9 fast ml-2012-10-25-So you want to work for Facebook
11 0.45775187 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
12 0.45161614 25 fast ml-2013-04-10-Gender discrimination
13 0.45085973 19 fast ml-2013-02-07-The secret of the big guys
14 0.45006087 35 fast ml-2013-08-12-Accelerometer Biometric Competition
15 0.44476199 40 fast ml-2013-10-06-Pylearn2 in practice
16 0.43851918 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
17 0.43739247 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
18 0.42719787 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
19 0.41964296 32 fast ml-2013-07-05-Processing large files, line by line
20 0.41223481 16 fast ml-2013-01-12-Intro to random forests