fast_ml fast_ml-2014 fast_ml-2014-60 knowledge-graph by maker-knowledge-mining

60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

meta infos for this blog

Source: html

Introduction: Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1 . This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers. UPDATE: Turns out that Pandas has get_dummies() function which does what we’re after. More on this in a while. We’ll use Pandas to load the data, do some cleaning and send it to Scikit-

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This may be a problem if you want to use such tool but your data includes categorical features. [sent-2, score-0.267]

2 To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1 . [sent-3, score-0.407]

3 We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. [sent-5, score-0.544]

4 It allows easier manipulation of tabular numeric and non-numeric data. [sent-7, score-0.216]

5 We’ll use Pandas to load the data, do some cleaning and send it to Scikit-learn’s DictVectorizer . [sent-12, score-0.196]

6 The difference is as follows: OneHotEncoder takes as input categorical values encoded as integers - you can get them from LabelEncoder . [sent-14, score-0.611]

7 The representation above is redundant, because to encode three values you need two indicator columns. [sent-16, score-0.23]

8 In general, one needs d - 1 columns for d values . [sent-17, score-0.34]

9 It won’t result in information loss, because in the redundant scheme with d columns one of the indicators must be non-zero, so if two out of three are zeros then the third must be 1 . [sent-20, score-0.88]

10 And if one among the two is positive than the third must be zero. [sent-21, score-0.219]

11 Pandas Before The question is how to convert some columns from a data frame to a list of dicts. [sent-22, score-0.546]

12 Then we create a new data frame containing only these columns. [sent-24, score-0.365]

13 Therefore we transpose the data frame and then call . [sent-26, score-0.217]

14 values() If you have a few categorical columns, you can list them as above. [sent-31, score-0.395]

15 In the Analytics Edge competition, there are about 100 categorical columns, so in this case it’s easier to drop columns which are not categorical: cols_to_drop = [ 'UserID, 'YOB', 'votes', 'Happy'] cat_df = df. [sent-32, score-0.66]

16 values() After Using the vectorizer from sklearn. [sent-37, score-0.18]

17 feature_extraction import DictVectorizer as DV vectorizer = DV( sparse = False ) vec_x_cat_train = vectorizer. [sent-38, score-0.18]

18 fillna( 'NA' ) This way, the vectorizer will create additional column=NA for each feature with NAs. [sent-42, score-0.268]

19 768 in terms of AUC, while the alternative representation yielded 0. [sent-44, score-0.187]

20 py , headers from the source file will end up in one of the output files, probably in train . [sent-58, score-0.17]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('dictvectorizer', 0.361), ('pandas', 0.361), ('categorical', 0.267), ('frame', 0.217), ('columns', 0.201), ('vectorizer', 0.18), ('dv', 0.145), ('encoded', 0.145), ('indicators', 0.145), ('onehotencoder', 0.145), ('missing', 0.143), ('values', 0.139), ('edge', 0.132), ('list', 0.128), ('must', 0.123), ('column', 0.121), ('keys', 0.12), ('redundant', 0.12), ('numeric', 0.12), ('analytics', 0.115), ('headers', 0.106), ('alternative', 0.096), ('third', 0.096), ('easier', 0.096), ('drop', 0.096), ('representation', 0.091), ('copy', 0.088), ('create', 0.088), ('axis', 0.076), ('load', 0.076), ('csv', 0.072), ('therefore', 0.072), ('zeros', 0.072), ('numbers', 0.068), ('source', 0.064), ('solution', 0.064), ('names', 0.061), ('na', 0.06), ('cleaning', 0.06), ('steep', 0.06), ('intuitive', 0.06), ('integers', 0.06), ('positives', 0.06), ('header', 0.06), ('containing', 0.06), ('touch', 0.06), ('combo', 0.06), ('send', 0.06), ('vectorizing', 0.06), ('competition', 0.059)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

2 0.12307119 20 fast ml-2013-02-18-Predicting advertised salaries

Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con

3 0.1140347 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

4 0.1050252 32 fast ml-2013-07-05-Processing large files, line by line

Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re

5 0.10272651 33 fast ml-2013-07-09-Introducing phraug

Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach

6 0.084382862 61 fast ml-2014-05-08-Impute missing values with Amelia

7 0.083719291 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

8 0.081540711 30 fast ml-2013-06-01-Amazon aspires to automate access control

9 0.070613503 25 fast ml-2013-04-10-Gender discrimination

10 0.067736447 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers

11 0.06466151 39 fast ml-2013-09-19-What you wanted to know about AUC

12 0.064102456 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

13 0.062834561 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

14 0.06110971 19 fast ml-2013-02-07-The secret of the big guys

15 0.060214791 10 fast ml-2012-11-17-The Facebook challenge HOWTO

16 0.058699191 17 fast ml-2013-01-14-Feature selection in practice

17 0.055926941 40 fast ml-2013-10-06-Pylearn2 in practice

18 0.055715203 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

19 0.054059297 36 fast ml-2013-08-23-A bag of words and a nice little network

20 0.052906036 50 fast ml-2014-01-20-How to get predictions from Pylearn2

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.242), (1, -0.183), (2, -0.002), (3, 0.005), (4, -0.02), (5, 0.077), (6, -0.169), (7, 0.103), (8, 0.092), (9, 0.103), (10, 0.055), (11, -0.114), (12, -0.005), (13, -0.09), (14, -0.153), (15, 0.344), (16, -0.164), (17, 0.243), (18, -0.08), (19, 0.007), (20, -0.051), (21, 0.003), (22, -0.078), (23, 0.053), (24, -0.063), (25, 0.185), (26, 0.113), (27, 0.02), (28, -0.153), (29, 0.18), (30, 0.139), (31, 0.033), (32, 0.137), (33, 0.002), (34, -0.015), (35, 0.192), (36, 0.159), (37, 0.221), (38, -0.032), (39, 0.409), (40, -0.121), (41, -0.182), (42, -0.131), (43, -0.113), (44, -0.173), (45, -0.013), (46, 0.093), (47, 0.061), (48, 0.007), (49, 0.091)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97574103 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

2 0.25419533 20 fast ml-2013-02-18-Predicting advertised salaries

3 0.18821107 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

4 0.18813884 61 fast ml-2014-05-08-Impute missing values with Amelia

Introduction: One of the ways to deal with missing values in data is to impute them. We use Amelia R package on The Analytics Edge competition data. Since one typically gets many imputed sets, we bag them with good results. So good that it seems we would have won the contest if not for a bug in our code. The competition Much to our surprise, we ranked 17th out of almost 1700 competitors - from the public leaderboard score we expected to be in top 10%, barely. The contest turned out to be one with huge overfitting possibilities, and people overfitted badly - some preliminary leaders ended up down the middle of the pack, while we soared up the ranks. But wait! There’s more. When preparing this article, we discovered a bug - apparently we used only 1980 points for training: points_in_test = 1980 train = data.iloc[:points_in_test,] # should be [:-points_in_test,] test = data.iloc[-points_in_test:,] If not for this little bug, we would have won, apparently. Imputing data A

5 0.17039712 32 fast ml-2013-07-05-Processing large files, line by line

6 0.15967892 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

7 0.15904868 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers

8 0.14475827 30 fast ml-2013-06-01-Amazon aspires to automate access control

9 0.13805413 19 fast ml-2013-02-07-The secret of the big guys

10 0.13718601 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

11 0.13388219 25 fast ml-2013-04-10-Gender discrimination

12 0.13242275 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

13 0.13199818 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

14 0.13128051 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

15 0.12852088 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

16 0.12694892 33 fast ml-2013-07-09-Introducing phraug

17 0.12393397 17 fast ml-2013-01-14-Feature selection in practice

18 0.12040702 36 fast ml-2013-08-23-A bag of words and a nice little network

19 0.11966226 41 fast ml-2013-10-09-Big data made easy

20 0.11913361 37 fast ml-2013-09-03-Our followers and who else they follow

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(6, 0.022), (9, 0.025), (26, 0.059), (31, 0.026), (35, 0.043), (48, 0.031), (55, 0.043), (67, 0.465), (69, 0.084), (71, 0.021), (78, 0.025), (79, 0.033), (81, 0.016), (99, 0.024)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.89044523 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

2 0.23325107 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

3 0.22808464 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

4 0.22398724 61 fast ml-2014-05-08-Impute missing values with Amelia

5 0.22093044 19 fast ml-2013-02-07-The secret of the big guys

Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni

6 0.22080629 36 fast ml-2013-08-23-A bag of words and a nice little network

7 0.21882118 39 fast ml-2013-09-19-What you wanted to know about AUC

8 0.21558425 9 fast ml-2012-10-25-So you want to work for Facebook

9 0.21543701 40 fast ml-2013-10-06-Pylearn2 in practice

10 0.21435657 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

11 0.21322973 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit

12 0.21141618 20 fast ml-2013-02-18-Predicting advertised salaries

13 0.21038313 17 fast ml-2013-01-14-Feature selection in practice

14 0.20697406 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

15 0.20639044 18 fast ml-2013-01-17-A very fast denoising autoencoder

16 0.20636448 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

17 0.20519657 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

18 0.20391369 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

19 0.20043056 43 fast ml-2013-11-02-Maxing out the digits

20 0.19956921 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data