fast_ml fast_ml-2014 fast_ml-2014-59 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: This time we attempt to predict if poll responders said they’re happy or not. We also take a look at what features are the most important for that prediction. There is a private competition at Kaggle for students of The Analytics Edge MOOC. You can get the invitation link by signing up for the course and going to the week seven front page. It’s an entry-level, educational contest - there’s no prizes and the data is small. The competition is based on data from the Show of Hands , a mobile polling application for US residents. The link between the app and the MOOC is MIT: it’s an MIT’s class and MIT’s alumnus’ app. You get a few thousand examples. Each consists of some demographics: year of birth gender income household status education level party preference Plus a number of answers to yes/no poll questions from the Show of Hands. Here’s a sample: Are you good at math? Have you cried in the past 60 days? Do you brush your teeth two or more times ever
sentIndex sentText sentNum sentScore
1 This time we attempt to predict if poll responders said they’re happy or not. [sent-1, score-0.516]
2 There is a private competition at Kaggle for students of The Analytics Edge MOOC. [sent-3, score-0.071]
3 You can get the invitation link by signing up for the course and going to the week seven front page. [sent-4, score-0.184]
4 It’s an entry-level, educational contest - there’s no prizes and the data is small. [sent-5, score-0.205]
5 The competition is based on data from the Show of Hands , a mobile polling application for US residents. [sent-6, score-0.196]
6 The link between the app and the MOOC is MIT: it’s an MIT’s class and MIT’s alumnus’ app. [sent-7, score-0.175]
7 Each consists of some demographics: year of birth gender income household status education level party preference Plus a number of answers to yes/no poll questions from the Show of Hands. [sent-9, score-1.001]
8 Do you drink the unfiltered tap water in your home? [sent-14, score-0.062]
9 The goal is to predict how a person responded to “are you happy? [sent-17, score-0.241]
10 It should be interesting to see what factors have the most impact on the target variable. [sent-20, score-0.166]
11 The contest is evaluated using AUC and the top of the leaderboard scores about 0. [sent-21, score-0.071]
12 The data has a lot of missing values, because few responders answered all 101 questions. [sent-23, score-0.351]
13 There’s a variable with vote count for each person: Among demographics only YOB has missing values. [sent-24, score-0.339]
14 na(data$YOB)] = 0 It is convenient to use R, because R can handle categorical values natively, without vectorizing to one-hot encoding. [sent-27, score-0.071]
15 8 # proportion of training examples n = nrow( data ) train_len = round( n * p_train ) test_start = train_len + 1 i = sample. [sent-29, score-0.063]
16 Now to see which factors are important for predicting happiness. [sent-38, score-0.104]
17 var = 18, main = "Importance of variables" ) It seems that what matters is mostly demographics, with an exception of gender. [sent-41, score-0.142]
18 Apparently there’s a gap: the three questions matter much more than the others. [sent-44, score-0.096]
19 They are: 118237: Do you feel like you are “in over-your-head” in any aspect of your life right now? [sent-45, score-0.267]
wordName wordTfidf (topN-words)
[('yob', 0.596), ('demographics', 0.255), ('mit', 0.255), ('income', 0.17), ('person', 0.17), ('poll', 0.17), ('matters', 0.142), ('education', 0.142), ('responders', 0.142), ('happy', 0.142), ('feel', 0.125), ('link', 0.113), ('importance', 0.113), ('factors', 0.104), ('level', 0.104), ('questions', 0.096), ('missing', 0.084), ('contest', 0.071), ('prizes', 0.071), ('life', 0.071), ('home', 0.071), ('mooc', 0.071), ('bayes', 0.071), ('naive', 0.071), ('aspect', 0.071), ('private', 0.071), ('round', 0.071), ('answers', 0.071), ('front', 0.071), ('child', 0.071), ('inherently', 0.071), ('mac', 0.071), ('responded', 0.071), ('application', 0.071), ('vectorizing', 0.071), ('auc', 0.068), ('show', 0.064), ('data', 0.063), ('status', 0.062), ('year', 0.062), ('impact', 0.062), ('math', 0.062), ('gender', 0.062), ('said', 0.062), ('app', 0.062), ('mobile', 0.062), ('study', 0.062), ('unfiltered', 0.062), ('answered', 0.062), ('birth', 0.062)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
Introduction: This time we attempt to predict if poll responders said they’re happy or not. We also take a look at what features are the most important for that prediction. There is a private competition at Kaggle for students of The Analytics Edge MOOC. You can get the invitation link by signing up for the course and going to the week seven front page. It’s an entry-level, educational contest - there’s no prizes and the data is small. The competition is based on data from the Show of Hands , a mobile polling application for US residents. The link between the app and the MOOC is MIT: it’s an MIT’s class and MIT’s alumnus’ app. You get a few thousand examples. Each consists of some demographics: year of birth gender income household status education level party preference Plus a number of answers to yes/no poll questions from the Show of Hands. Here’s a sample: Are you good at math? Have you cried in the past 60 days? Do you brush your teeth two or more times ever
2 0.067736447 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
Introduction: Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1 . This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers. UPDATE: Turns out that Pandas has get_dummies() function which does what we’re after. More on this in a while. We’ll use Pandas to load the data, do some cleaning and send it to Scikit-
3 0.066710711 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
4 0.063878193 22 fast ml-2013-03-07-Choosing a machine learning algorithm
Introduction: To celbrate the first 100 followers on Twitter, we asked them what would they like to read about here. One of the responders, Itamar Berger, suggested a topic: how to choose a ML algorithm for a task at hand. Well, what do we now? Three things come to mind: We’d try fast things first. In terms of speed, here’s how we imagine the order: linear models trees, that is bagged or boosted trees everything else* We’d use something we are comfortable with. Learning new things is very exciting, however we’d ask ourselves a question: do we want to learn a new technique, or do we want a result? We’d prefer something with fewer hyperparameters to set. More params means more tuning, that is training and re-training over and over, even if automatically. Random forests are hard to beat in this department. Linear models are pretty good too. A random forest scene, credit: Jonathan MacGregor *By “everything else” we mean “everything popular”, mostly things like
5 0.056177124 57 fast ml-2014-04-01-Exclusive Geoff Hinton interview
Introduction: Geoff Hinton is a living legend. He almost single-handedly invented backpropagation for training feed-forward neural networks. Despite in theory being universal function approximators, these networks turned out to be pretty much useless for more complex problems, like computer vision and speech recognition. Professor Hinton responded by creating deep networks and deep learning, an ultimate form of machine learning. Recently we’ve been fortunate to ask Geoff a few questions and have him answer them. Geoff, thanks so much for talking to us. You’ve had a long and fruitful career. What drives you these days? Well, after a man hits a certain age, his priorities change. Back in the 80s I was happy when I was able to train a network with eight hidden units. Now I can finally have thousands and possibly millions of them. So I guess the answer is scale. Apart from that, I like people at Google and I like making them a ton of money. They happen to pay me well, so it’s a win-win situ
6 0.05311304 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
7 0.053031158 25 fast ml-2013-04-10-Gender discrimination
8 0.05198383 39 fast ml-2013-09-19-What you wanted to know about AUC
9 0.049521089 61 fast ml-2014-05-08-Impute missing values with Amelia
10 0.046987928 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
11 0.046892606 16 fast ml-2013-01-12-Intro to random forests
12 0.04442066 26 fast ml-2013-04-17-Regression as classification
13 0.044322941 15 fast ml-2013-01-07-Machine learning courses online
14 0.042807452 13 fast ml-2012-12-27-Spearmint with a random forest
15 0.042773083 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
16 0.042246521 50 fast ml-2014-01-20-How to get predictions from Pylearn2
17 0.040002149 37 fast ml-2013-09-03-Our followers and who else they follow
18 0.038819954 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
19 0.038094558 19 fast ml-2013-02-07-The secret of the big guys
20 0.036525354 27 fast ml-2013-05-01-Deep learning made easy
topicId topicWeight
[(0, 0.161), (1, 0.001), (2, -0.01), (3, -0.029), (4, 0.019), (5, 0.072), (6, -0.048), (7, 0.025), (8, 0.109), (9, 0.144), (10, -0.16), (11, -0.173), (12, -0.32), (13, 0.018), (14, -0.206), (15, 0.111), (16, -0.243), (17, -0.033), (18, 0.055), (19, -0.213), (20, 0.103), (21, 0.208), (22, 0.154), (23, 0.042), (24, 0.054), (25, 0.156), (26, -0.005), (27, -0.43), (28, -0.036), (29, -0.102), (30, 0.259), (31, 0.21), (32, 0.035), (33, -0.076), (34, 0.032), (35, 0.134), (36, -0.058), (37, -0.073), (38, -0.179), (39, -0.277), (40, 0.144), (41, 0.098), (42, 0.056), (43, 0.058), (44, -0.026), (45, -0.014), (46, -0.044), (47, 0.075), (48, -0.005), (49, -0.009)]
simIndex simValue blogId blogTitle
same-blog 1 0.96859556 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
Introduction: This time we attempt to predict if poll responders said they’re happy or not. We also take a look at what features are the most important for that prediction. There is a private competition at Kaggle for students of The Analytics Edge MOOC. You can get the invitation link by signing up for the course and going to the week seven front page. It’s an entry-level, educational contest - there’s no prizes and the data is small. The competition is based on data from the Show of Hands , a mobile polling application for US residents. The link between the app and the MOOC is MIT: it’s an MIT’s class and MIT’s alumnus’ app. You get a few thousand examples. Each consists of some demographics: year of birth gender income household status education level party preference Plus a number of answers to yes/no poll questions from the Show of Hands. Here’s a sample: Are you good at math? Have you cried in the past 60 days? Do you brush your teeth two or more times ever
2 0.13069297 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
Introduction: Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1 . This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers. UPDATE: Turns out that Pandas has get_dummies() function which does what we’re after. More on this in a while. We’ll use Pandas to load the data, do some cleaning and send it to Scikit-
3 0.12939933 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
4 0.12180355 61 fast ml-2014-05-08-Impute missing values with Amelia
Introduction: One of the ways to deal with missing values in data is to impute them. We use Amelia R package on The Analytics Edge competition data. Since one typically gets many imputed sets, we bag them with good results. So good that it seems we would have won the contest if not for a bug in our code. The competition Much to our surprise, we ranked 17th out of almost 1700 competitors - from the public leaderboard score we expected to be in top 10%, barely. The contest turned out to be one with huge overfitting possibilities, and people overfitted badly - some preliminary leaders ended up down the middle of the pack, while we soared up the ranks. But wait! There’s more. When preparing this article, we discovered a bug - apparently we used only 1980 points for training: points_in_test = 1980 train = data.iloc[:points_in_test,] # should be [:-points_in_test,] test = data.iloc[-points_in_test:,] If not for this little bug, we would have won, apparently. Imputing data A
5 0.11958636 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
Introduction: Kaggle again. This time, solar energy prediction. We will show how to get data out of NetCDF4 files in Python and then beat the benchmark. The goal of this competition is to predict solar energy at Oklahoma Mesonet stations (red dots) from weather forecasts for GEFS points (blue dots): We’re getting a number of NetCDF files, each holding info about one variable, like expected temperature or precipitation. These variables have a few dimensions: time: the training set contains information about 5113 consecutive days. For each day there are five forecasts for different hours. location: location is described by latitude and longitude of GEFS points weather models: there are 11 forecasting models, called ensembles NetCDF4 tutorial The data is in NetCDF files, binary format apparently popular for storing scientific data. We will access it from Python using netcdf4-python . To use it, you will need to install HDF5 and NetCDF4 libraries first. If you’re on Wind
6 0.11738061 25 fast ml-2013-04-10-Gender discrimination
7 0.11119419 19 fast ml-2013-02-07-The secret of the big guys
8 0.10469081 13 fast ml-2012-12-27-Spearmint with a random forest
9 0.10397083 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
10 0.099189639 27 fast ml-2013-05-01-Deep learning made easy
11 0.097140335 41 fast ml-2013-10-09-Big data made easy
12 0.092935458 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
13 0.092767172 37 fast ml-2013-09-03-Our followers and who else they follow
14 0.091061227 39 fast ml-2013-09-19-What you wanted to know about AUC
15 0.086749516 40 fast ml-2013-10-06-Pylearn2 in practice
16 0.083960935 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
17 0.083868511 18 fast ml-2013-01-17-A very fast denoising autoencoder
18 0.082679503 8 fast ml-2012-10-15-Merck challenge
19 0.082175709 22 fast ml-2013-03-07-Choosing a machine learning algorithm
20 0.081915461 15 fast ml-2013-01-07-Machine learning courses online
topicId topicWeight
[(26, 0.045), (31, 0.017), (35, 0.021), (36, 0.554), (55, 0.027), (58, 0.023), (69, 0.126), (71, 0.037), (81, 0.014), (99, 0.033)]
simIndex simValue blogId blogTitle
same-blog 1 0.87707311 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
Introduction: This time we attempt to predict if poll responders said they’re happy or not. We also take a look at what features are the most important for that prediction. There is a private competition at Kaggle for students of The Analytics Edge MOOC. You can get the invitation link by signing up for the course and going to the week seven front page. It’s an entry-level, educational contest - there’s no prizes and the data is small. The competition is based on data from the Show of Hands , a mobile polling application for US residents. The link between the app and the MOOC is MIT: it’s an MIT’s class and MIT’s alumnus’ app. You get a few thousand examples. Each consists of some demographics: year of birth gender income household status education level party preference Plus a number of answers to yes/no poll questions from the Show of Hands. Here’s a sample: Are you good at math? Have you cried in the past 60 days? Do you brush your teeth two or more times ever
2 0.23501182 27 fast ml-2013-05-01-Deep learning made easy
Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends
3 0.23490129 13 fast ml-2012-12-27-Spearmint with a random forest
Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe
4 0.22693378 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was
5 0.22668985 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
Introduction: Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. They will evaluate the results using mean average precision, or MAP, metric. Specifically MAP@x - this means they ask us to recommend x items for each user. So what is this MAP? First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is MAP. So now, what is AP, or average precision? It may be that we don’t really need to know. But we probably need to know this: we can recommend at most x items for each user it pays to submit all x recommendations, because we are not penalized for bad guesses order matters, so it’s better to submit more certain recommendations fi
6 0.22397964 18 fast ml-2013-01-17-A very fast denoising autoencoder
7 0.22204611 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
8 0.2216862 17 fast ml-2013-01-14-Feature selection in practice
9 0.22110589 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
10 0.22023039 9 fast ml-2012-10-25-So you want to work for Facebook
11 0.21938699 20 fast ml-2013-02-18-Predicting advertised salaries
12 0.21935959 43 fast ml-2013-11-02-Maxing out the digits
13 0.21609156 16 fast ml-2013-01-12-Intro to random forests
14 0.21385147 40 fast ml-2013-10-06-Pylearn2 in practice
15 0.21031474 19 fast ml-2013-02-07-The secret of the big guys
16 0.209757 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
17 0.20956275 35 fast ml-2013-08-12-Accelerometer Biometric Competition
18 0.20644885 25 fast ml-2013-04-10-Gender discrimination
19 0.20163199 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
20 0.1991033 32 fast ml-2013-07-05-Processing large files, line by line