fast_ml fast_ml-2013 fast_ml-2013-25 knowledge-graph by maker-knowledge-mining

25 fast ml-2013-04-10-Gender discrimination

meta infos for this blog

Source: html

Introduction: There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not? Congratulations! You have spotted the ceiling cat. As Sashi noticed on the forums , it’s not difficult to improve on the benchmarks a little bit. In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. Here’s our version of the story. Let’s start with normalizing. There’s a nice function for that in R, scale() . The dataset is small, 1128 examples, so we can go ahead and use R. Turns out that in its raw form, the data won’t scale. That’s because apparently there are some columns with zeros only. That makes it difficult to divide. Fortunately, we know just the right tool for the task. We learned about it on Kaggle forums too. It’s a function in caret package called nearZeroVar() . It will give you indexes of all the columns which have near zero var

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 They want to be able to discriminate men from women based on handwriting. [sent-2, score-0.356]

2 As Sashi noticed on the forums , it’s not difficult to improve on the benchmarks a little bit. [sent-6, score-0.305]

3 In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. [sent-7, score-0.186]

4 It will give you indexes of all the columns which have near zero variance, so you can delete them: nzv_i = nearZeroVar( data ) data = data[,-nzv_i] Easy as that. [sent-18, score-0.142]

5 We’ll just represent Arabic/English as 0/1 and get rid of writer IDs before scaling. [sent-23, score-0.37]

6 Feature selection The dataset has a thousand examples and seven thousand features. [sent-26, score-0.525]

7 We use mRMR, as described in Feature selection in practice . [sent-33, score-0.157]

8 Training a classifier Regularized linear model works better than GLM. [sent-34, score-0.227]

9 Its advantage is an ability to determine lambda automatically. [sent-36, score-0.339]

10 , train ) # V1 is y, or gender p = predict( model, test ) # normalize p to <0,1> range here, # or just cut down values outside this range or maybe logisticRidge , which is a bit slower: model = logisticRidge( V1 ~ . [sent-39, score-0.425]

11 , train ) p = predict( model, test ) p = sigmoid( p ) Random forest works similiarly well. [sent-40, score-0.155]

12 Validation There’s a sneaky issue with this data set: if you want realistic metrics from validation, you need to validate a certain way. [sent-43, score-0.293]

13 Namely, you need to split the set so that the writers which are in the train part are not in the test part. [sent-44, score-0.309]

14 Otherwise any powerful classifier like random forest will learn to discriminate on particular writer’s style. [sent-45, score-0.415]

15 This will result in overly optimistic score, as the real test set will consist only of unknown writers. [sent-47, score-0.155]

16 We provide a Python script for randomly splitting the data taking the writer issue into account. [sent-48, score-0.642]

17 9 The first argument is the original training file with a writers column. [sent-55, score-0.32]

18 And the reason that these are different is that you may want to split a file with writers info already stripped. [sent-57, score-0.409]

19 The third and fourth arguments are output files and the fifth is a ratio between these files - a probability, by default 0. [sent-58, score-0.424]

20 If instead of splitting you’d like to divide a file into a few same-sized chunks, use this: chunk_by_writers. [sent-60, score-0.293]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('writer', 0.37), ('writers', 0.231), ('discriminate', 0.185), ('lambda', 0.185), ('logisticridge', 0.185), ('nearzerovar', 0.185), ('weren', 0.185), ('thousand', 0.184), ('selection', 0.157), ('ratio', 0.154), ('classifier', 0.138), ('forums', 0.136), ('issue', 0.136), ('splitting', 0.136), ('ids', 0.113), ('range', 0.113), ('regularized', 0.113), ('files', 0.101), ('package', 0.098), ('difficult', 0.092), ('particular', 0.092), ('model', 0.089), ('want', 0.089), ('file', 0.089), ('based', 0.082), ('test', 0.078), ('scale', 0.077), ('determine', 0.077), ('cut', 0.077), ('ability', 0.077), ('ceiling', 0.077), ('congratulations', 0.077), ('spotted', 0.077), ('chunks', 0.077), ('held', 0.077), ('noticed', 0.077), ('similiarly', 0.077), ('unknown', 0.077), ('columns', 0.074), ('feature', 0.073), ('validation', 0.07), ('divide', 0.068), ('normalize', 0.068), ('validate', 0.068), ('mrmr', 0.068), ('variance', 0.068), ('delete', 0.068), ('ahead', 0.068), ('arbitrary', 0.068), ('arguments', 0.068)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 25 fast ml-2013-04-10-Gender discrimination

2 0.1461945 20 fast ml-2013-02-18-Predicting advertised salaries

Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con

3 0.1364352 17 fast ml-2013-01-14-Feature selection in practice

Introduction: Lately we’ve been working with the Madelon dataset. It was originally prepared for a feature selection challenge, so while we’re at it, let’s select some features. Madelon has 500 attributes, 20 of which are real, the rest being noise. Hence the ideal scenario would be to select just those 20 features. Fortunately we know just the right software for this task. It’s called mRMR , for minimum Redundancy Maximum Relevance , and is available in C and Matlab versions for various platforms. mRMR expects a CSV file with labels in the first column and feature names in the first row. So the game plan is: combine training and validation sets into a format expected by mRMR run selection filter the original datasets, discarding all features but the selected ones evaluate the results on the validation set if all goes well, prepare and submit files for the competition We’ll use R scripts for all the steps but feature selection. Now a few words about mRMR. It will show you p

4 0.133012 32 fast ml-2013-07-05-Processing large files, line by line

Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re

5 0.11113225 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

6 0.10777096 33 fast ml-2013-07-09-Introducing phraug

7 0.10389057 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

8 0.09807732 43 fast ml-2013-11-02-Maxing out the digits

9 0.091429599 19 fast ml-2013-02-07-The secret of the big guys

10 0.085884161 40 fast ml-2013-10-06-Pylearn2 in practice

11 0.083692372 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

12 0.082263641 14 fast ml-2013-01-04-Madelon: Spearmint's revenge

13 0.074314132 42 fast ml-2013-10-28-How much data is enough?

14 0.072276875 61 fast ml-2014-05-08-Impute missing values with Amelia

15 0.070613503 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

16 0.069915958 27 fast ml-2013-05-01-Deep learning made easy

17 0.069335684 50 fast ml-2014-01-20-How to get predictions from Pylearn2

18 0.067094244 13 fast ml-2012-12-27-Spearmint with a random forest

19 0.06680014 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial

20 0.061322924 39 fast ml-2013-09-19-What you wanted to know about AUC

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.301), (1, -0.132), (2, -0.088), (3, -0.055), (4, -0.052), (5, 0.036), (6, -0.135), (7, 0.172), (8, -0.075), (9, -0.056), (10, 0.018), (11, -0.169), (12, -0.169), (13, -0.161), (14, -0.048), (15, 0.01), (16, 0.159), (17, 0.082), (18, 0.13), (19, -0.12), (20, 0.145), (21, -0.097), (22, -0.004), (23, 0.13), (24, 0.031), (25, -0.29), (26, 0.093), (27, 0.084), (28, 0.016), (29, -0.051), (30, 0.039), (31, 0.045), (32, -0.154), (33, 0.101), (34, -0.116), (35, -0.109), (36, -0.341), (37, 0.196), (38, -0.069), (39, -0.214), (40, -0.145), (41, -0.124), (42, -0.098), (43, -0.233), (44, 0.109), (45, -0.289), (46, 0.047), (47, -0.043), (48, 0.063), (49, 0.103)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9619835 25 fast ml-2013-04-10-Gender discrimination

2 0.35960218 20 fast ml-2013-02-18-Predicting advertised salaries

3 0.24277633 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

4 0.2368509 17 fast ml-2013-01-14-Feature selection in practice

5 0.23291393 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

6 0.23216921 32 fast ml-2013-07-05-Processing large files, line by line

7 0.21599293 43 fast ml-2013-11-02-Maxing out the digits

8 0.19261011 19 fast ml-2013-02-07-The secret of the big guys

9 0.17532733 61 fast ml-2014-05-08-Impute missing values with Amelia

10 0.17110561 27 fast ml-2013-05-01-Deep learning made easy

11 0.16864641 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

12 0.1671581 40 fast ml-2013-10-06-Pylearn2 in practice

13 0.15985937 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial

14 0.15937267 14 fast ml-2013-01-04-Madelon: Spearmint's revenge

15 0.15660562 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers

16 0.14622393 30 fast ml-2013-06-01-Amazon aspires to automate access control

17 0.14571519 50 fast ml-2014-01-20-How to get predictions from Pylearn2

18 0.14548014 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python

19 0.14486431 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

20 0.14274617 42 fast ml-2013-10-28-How much data is enough?

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(6, 0.024), (26, 0.063), (31, 0.045), (35, 0.018), (37, 0.412), (55, 0.079), (58, 0.013), (69, 0.141), (71, 0.031), (81, 0.012), (85, 0.014), (99, 0.065)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.86045218 25 fast ml-2013-04-10-Gender discrimination

2 0.37702295 32 fast ml-2013-07-05-Processing large files, line by line

3 0.37017035 16 fast ml-2013-01-12-Intro to random forests

Introduction: Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive? First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks. There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting). Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods. Besides effectivnes

4 0.36870518 61 fast ml-2014-05-08-Impute missing values with Amelia

Introduction: One of the ways to deal with missing values in data is to impute them. We use Amelia R package on The Analytics Edge competition data. Since one typically gets many imputed sets, we bag them with good results. So good that it seems we would have won the contest if not for a bug in our code. The competition Much to our surprise, we ranked 17th out of almost 1700 competitors - from the public leaderboard score we expected to be in top 10%, barely. The contest turned out to be one with huge overfitting possibilities, and people overfitted badly - some preliminary leaders ended up down the middle of the pack, while we soared up the ranks. But wait! There’s more. When preparing this article, we discovered a bug - apparently we used only 1980 points for training: points_in_test = 1980 train = data.iloc[:points_in_test,] # should be [:-points_in_test,] test = data.iloc[-points_in_test:,] If not for this little bug, we would have won, apparently. Imputing data A

5 0.35574716 20 fast ml-2013-02-18-Predicting advertised salaries

6 0.34964687 17 fast ml-2013-01-14-Feature selection in practice

7 0.34689611 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

8 0.34490696 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

9 0.34342158 27 fast ml-2013-05-01-Deep learning made easy

10 0.34325314 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

11 0.34124497 40 fast ml-2013-10-06-Pylearn2 in practice

12 0.33893454 9 fast ml-2012-10-25-So you want to work for Facebook

13 0.33747435 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

14 0.33616656 13 fast ml-2012-12-27-Spearmint with a random forest

15 0.33237523 18 fast ml-2013-01-17-A very fast denoising autoencoder

16 0.33193752 19 fast ml-2013-02-07-The secret of the big guys

17 0.33163399 43 fast ml-2013-11-02-Maxing out the digits

18 0.32614601 35 fast ml-2013-08-12-Accelerometer Biometric Competition

19 0.32152048 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision

20 0.32118806 14 fast ml-2013-01-04-Madelon: Spearmint's revenge