fast_ml fast_ml-2014 fast_ml-2014-60 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1 . This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers. UPDATE: Turns out that Pandas has get_dummies() function which does what we’re after. More on this in a while. We’ll use Pandas to load the data, do some cleaning and send it to Scikit-
sentIndex sentText sentNum sentScore
1 This may be a problem if you want to use such tool but your data includes categorical features. [sent-2, score-0.267]
2 To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1 . [sent-3, score-0.407]
3 We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. [sent-5, score-0.544]
4 It allows easier manipulation of tabular numeric and non-numeric data. [sent-7, score-0.216]
5 We’ll use Pandas to load the data, do some cleaning and send it to Scikit-learn’s DictVectorizer . [sent-12, score-0.196]
6 The difference is as follows: OneHotEncoder takes as input categorical values encoded as integers - you can get them from LabelEncoder . [sent-14, score-0.611]
7 The representation above is redundant, because to encode three values you need two indicator columns. [sent-16, score-0.23]
8 In general, one needs d - 1 columns for d values . [sent-17, score-0.34]
9 It won’t result in information loss, because in the redundant scheme with d columns one of the indicators must be non-zero, so if two out of three are zeros then the third must be 1 . [sent-20, score-0.88]
10 And if one among the two is positive than the third must be zero. [sent-21, score-0.219]
11 Pandas Before The question is how to convert some columns from a data frame to a list of dicts. [sent-22, score-0.546]
12 Then we create a new data frame containing only these columns. [sent-24, score-0.365]
13 Therefore we transpose the data frame and then call . [sent-26, score-0.217]
14 values() If you have a few categorical columns, you can list them as above. [sent-31, score-0.395]
15 In the Analytics Edge competition, there are about 100 categorical columns, so in this case it’s easier to drop columns which are not categorical: cols_to_drop = [ 'UserID, 'YOB', 'votes', 'Happy'] cat_df = df. [sent-32, score-0.66]
16 values() After Using the vectorizer from sklearn. [sent-37, score-0.18]
17 feature_extraction import DictVectorizer as DV vectorizer = DV( sparse = False ) vec_x_cat_train = vectorizer. [sent-38, score-0.18]
18 fillna( 'NA' ) This way, the vectorizer will create additional column=NA for each feature with NAs. [sent-42, score-0.268]
19 768 in terms of AUC, while the alternative representation yielded 0. [sent-44, score-0.187]
20 py , headers from the source file will end up in one of the output files, probably in train . [sent-58, score-0.17]
wordName wordTfidf (topN-words)
[('dictvectorizer', 0.361), ('pandas', 0.361), ('categorical', 0.267), ('frame', 0.217), ('columns', 0.201), ('vectorizer', 0.18), ('dv', 0.145), ('encoded', 0.145), ('indicators', 0.145), ('onehotencoder', 0.145), ('missing', 0.143), ('values', 0.139), ('edge', 0.132), ('list', 0.128), ('must', 0.123), ('column', 0.121), ('keys', 0.12), ('redundant', 0.12), ('numeric', 0.12), ('analytics', 0.115), ('headers', 0.106), ('alternative', 0.096), ('third', 0.096), ('easier', 0.096), ('drop', 0.096), ('representation', 0.091), ('copy', 0.088), ('create', 0.088), ('axis', 0.076), ('load', 0.076), ('csv', 0.072), ('therefore', 0.072), ('zeros', 0.072), ('numbers', 0.068), ('source', 0.064), ('solution', 0.064), ('names', 0.061), ('na', 0.06), ('cleaning', 0.06), ('steep', 0.06), ('intuitive', 0.06), ('integers', 0.06), ('positives', 0.06), ('header', 0.06), ('containing', 0.06), ('touch', 0.06), ('combo', 0.06), ('send', 0.06), ('vectorizing', 0.06), ('competition', 0.059)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
Introduction: Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1 . This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers. UPDATE: Turns out that Pandas has get_dummies() function which does what we’re after. More on this in a while. We’ll use Pandas to load the data, do some cleaning and send it to Scikit-
2 0.12307119 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
3 0.1140347 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.
4 0.1050252 32 fast ml-2013-07-05-Processing large files, line by line
Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re
5 0.10272651 33 fast ml-2013-07-09-Introducing phraug
Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach
6 0.084382862 61 fast ml-2014-05-08-Impute missing values with Amelia
7 0.083719291 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
8 0.081540711 30 fast ml-2013-06-01-Amazon aspires to automate access control
9 0.070613503 25 fast ml-2013-04-10-Gender discrimination
10 0.067736447 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
11 0.06466151 39 fast ml-2013-09-19-What you wanted to know about AUC
12 0.064102456 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
13 0.062834561 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
14 0.06110971 19 fast ml-2013-02-07-The secret of the big guys
15 0.060214791 10 fast ml-2012-11-17-The Facebook challenge HOWTO
16 0.058699191 17 fast ml-2013-01-14-Feature selection in practice
17 0.055926941 40 fast ml-2013-10-06-Pylearn2 in practice
18 0.055715203 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
19 0.054059297 36 fast ml-2013-08-23-A bag of words and a nice little network
20 0.052906036 50 fast ml-2014-01-20-How to get predictions from Pylearn2
topicId topicWeight
[(0, 0.242), (1, -0.183), (2, -0.002), (3, 0.005), (4, -0.02), (5, 0.077), (6, -0.169), (7, 0.103), (8, 0.092), (9, 0.103), (10, 0.055), (11, -0.114), (12, -0.005), (13, -0.09), (14, -0.153), (15, 0.344), (16, -0.164), (17, 0.243), (18, -0.08), (19, 0.007), (20, -0.051), (21, 0.003), (22, -0.078), (23, 0.053), (24, -0.063), (25, 0.185), (26, 0.113), (27, 0.02), (28, -0.153), (29, 0.18), (30, 0.139), (31, 0.033), (32, 0.137), (33, 0.002), (34, -0.015), (35, 0.192), (36, 0.159), (37, 0.221), (38, -0.032), (39, 0.409), (40, -0.121), (41, -0.182), (42, -0.131), (43, -0.113), (44, -0.173), (45, -0.013), (46, 0.093), (47, 0.061), (48, 0.007), (49, 0.091)]
simIndex simValue blogId blogTitle
same-blog 1 0.97574103 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
Introduction: Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1 . This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers. UPDATE: Turns out that Pandas has get_dummies() function which does what we’re after. More on this in a while. We’ll use Pandas to load the data, do some cleaning and send it to Scikit-
2 0.25419533 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
3 0.18821107 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.
4 0.18813884 61 fast ml-2014-05-08-Impute missing values with Amelia
Introduction: One of the ways to deal with missing values in data is to impute them. We use Amelia R package on The Analytics Edge competition data. Since one typically gets many imputed sets, we bag them with good results. So good that it seems we would have won the contest if not for a bug in our code. The competition Much to our surprise, we ranked 17th out of almost 1700 competitors - from the public leaderboard score we expected to be in top 10%, barely. The contest turned out to be one with huge overfitting possibilities, and people overfitted badly - some preliminary leaders ended up down the middle of the pack, while we soared up the ranks. But wait! There’s more. When preparing this article, we discovered a bug - apparently we used only 1980 points for training: points_in_test = 1980 train = data.iloc[:points_in_test,] # should be [:-points_in_test,] test = data.iloc[-points_in_test:,] If not for this little bug, we would have won, apparently. Imputing data A
5 0.17039712 32 fast ml-2013-07-05-Processing large files, line by line
Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re
6 0.15967892 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
7 0.15904868 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
8 0.14475827 30 fast ml-2013-06-01-Amazon aspires to automate access control
9 0.13805413 19 fast ml-2013-02-07-The secret of the big guys
10 0.13718601 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
11 0.13388219 25 fast ml-2013-04-10-Gender discrimination
12 0.13242275 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
13 0.13199818 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
14 0.13128051 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
15 0.12852088 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
16 0.12694892 33 fast ml-2013-07-09-Introducing phraug
17 0.12393397 17 fast ml-2013-01-14-Feature selection in practice
18 0.12040702 36 fast ml-2013-08-23-A bag of words and a nice little network
19 0.11966226 41 fast ml-2013-10-09-Big data made easy
20 0.11913361 37 fast ml-2013-09-03-Our followers and who else they follow
topicId topicWeight
[(6, 0.022), (9, 0.025), (26, 0.059), (31, 0.026), (35, 0.043), (48, 0.031), (55, 0.043), (67, 0.465), (69, 0.084), (71, 0.021), (78, 0.025), (79, 0.033), (81, 0.016), (99, 0.024)]
simIndex simValue blogId blogTitle
same-blog 1 0.89044523 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
Introduction: Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1 . This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers. UPDATE: Turns out that Pandas has get_dummies() function which does what we’re after. More on this in a while. We’ll use Pandas to load the data, do some cleaning and send it to Scikit-
2 0.23325107 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid
3 0.22808464 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.
4 0.22398724 61 fast ml-2014-05-08-Impute missing values with Amelia
Introduction: One of the ways to deal with missing values in data is to impute them. We use Amelia R package on The Analytics Edge competition data. Since one typically gets many imputed sets, we bag them with good results. So good that it seems we would have won the contest if not for a bug in our code. The competition Much to our surprise, we ranked 17th out of almost 1700 competitors - from the public leaderboard score we expected to be in top 10%, barely. The contest turned out to be one with huge overfitting possibilities, and people overfitted badly - some preliminary leaders ended up down the middle of the pack, while we soared up the ranks. But wait! There’s more. When preparing this article, we discovered a bug - apparently we used only 1980 points for training: points_in_test = 1980 train = data.iloc[:points_in_test,] # should be [:-points_in_test,] test = data.iloc[-points_in_test:,] If not for this little bug, we would have won, apparently. Imputing data A
5 0.22093044 19 fast ml-2013-02-07-The secret of the big guys
Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni
6 0.22080629 36 fast ml-2013-08-23-A bag of words and a nice little network
7 0.21882118 39 fast ml-2013-09-19-What you wanted to know about AUC
8 0.21558425 9 fast ml-2012-10-25-So you want to work for Facebook
9 0.21543701 40 fast ml-2013-10-06-Pylearn2 in practice
10 0.21435657 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
11 0.21322973 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
12 0.21141618 20 fast ml-2013-02-18-Predicting advertised salaries
13 0.21038313 17 fast ml-2013-01-14-Feature selection in practice
14 0.20697406 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
15 0.20639044 18 fast ml-2013-01-17-A very fast denoising autoencoder
16 0.20636448 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
17 0.20519657 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
18 0.20391369 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
19 0.20043056 43 fast ml-2013-11-02-Maxing out the digits
20 0.19956921 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data