fast_ml fast_ml-2012 fast_ml-2012-6 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o
sentIndex sentText sentNum sentScore
1 Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. [sent-1, score-0.906]
2 If that is not enough, then we go to the benchmark. [sent-3, score-0.053]
3 We will spell-correct test queries when looking in our query -> sku mapping. [sent-6, score-0.944]
4 Edit distance is one (just adding “e”), so we will catch that easily. [sent-13, score-0.201]
wordName wordTfidf (topN-words)
[('search', 0.357), ('queries', 0.322), ('sku', 0.29), ('query', 0.284), ('product', 0.273), ('matches', 0.241), ('spelling', 0.213), ('ace', 0.193), ('combat', 0.193), ('corrected', 0.193), ('correction', 0.193), ('skus', 0.193), ('names', 0.162), ('fill', 0.142), ('name', 0.11), ('five', 0.1), ('less', 0.095), ('mapping', 0.085), ('found', 0.083), ('correct', 0.08), ('xml', 0.08), ('edit', 0.08), ('hits', 0.08), ('improving', 0.08), ('processed', 0.08), ('searching', 0.08), ('waiting', 0.08), ('catch', 0.071), ('distance', 0.071), ('places', 0.071), ('norvig', 0.064), ('ten', 0.064), ('benchmark', 0.063), ('adding', 0.059), ('http', 0.059), ('improve', 0.059), ('ideas', 0.055), ('peter', 0.055), ('go', 0.053), ('taking', 0.051), ('follows', 0.051), ('wait', 0.051), ('results', 0.051), ('sum', 0.048), ('looking', 0.048), ('leaderboard', 0.045), ('extract', 0.043), ('final', 0.043), ('instead', 0.041), ('sure', 0.04)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999976 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure
Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o
2 0.31442678 4 fast ml-2012-09-17-Best Buy mobile contest
Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351
3 0.2097194 10 fast ml-2012-11-17-The Facebook challenge HOWTO
Introduction: Last time we wrote about the Facebook challenge . Now it’s time for some more details. The main concept is this: in its original state, the data is useless. That’s because there are many names referring to the same entity. Precisely, there are about 350k unique names, and the total number of entities is maybe 20k. So cleaning the data is the first and most important step. Data cleaning That’s the things we do, in order: extract all the names, including numbers, from the graph files and the paths file compute a fingerprint for each name. We used fingerprints similar to ones in Google Refine . Ours not only had sorted tokens (words) in a name, but also sorted letters in a token. Names with the same fingerprint are very likely the same. Remaining distortions are mostly in the form of word combinations: a name a longer name a still longer name this can be dealt with by checking if a given name is a subset of any other name. If it is a subset of only one other
4 0.1658964 5 fast ml-2012-09-19-Best Buy mobile contest - big data
Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before
5 0.045751207 19 fast ml-2013-02-07-The secret of the big guys
Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni
6 0.041509125 37 fast ml-2013-09-03-Our followers and who else they follow
7 0.041497327 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
8 0.040676199 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
9 0.037523836 43 fast ml-2013-11-02-Maxing out the digits
10 0.037084449 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
11 0.037079439 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
12 0.035358377 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
13 0.033557903 57 fast ml-2014-04-01-Exclusive Geoff Hinton interview
14 0.032964986 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
15 0.029984774 46 fast ml-2013-12-07-13 NIPS papers that caught our eye
16 0.029311934 25 fast ml-2013-04-10-Gender discrimination
17 0.028998194 17 fast ml-2013-01-14-Feature selection in practice
18 0.028837735 47 fast ml-2013-12-15-A-B testing with bayesian bandits in Google Analytics
19 0.028693188 9 fast ml-2012-10-25-So you want to work for Facebook
20 0.026119674 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
topicId topicWeight
[(0, 0.146), (1, -0.103), (2, 0.059), (3, -0.581), (4, 0.058), (5, -0.179), (6, 0.075), (7, -0.171), (8, -0.063), (9, 0.107), (10, -0.088), (11, 0.001), (12, 0.241), (13, -0.084), (14, 0.031), (15, 0.199), (16, 0.023), (17, -0.166), (18, -0.048), (19, 0.011), (20, 0.081), (21, 0.031), (22, -0.029), (23, 0.014), (24, -0.115), (25, -0.007), (26, 0.101), (27, 0.089), (28, 0.025), (29, 0.046), (30, 0.142), (31, 0.099), (32, 0.095), (33, 0.112), (34, 0.004), (35, 0.008), (36, -0.03), (37, -0.031), (38, 0.061), (39, -0.098), (40, -0.007), (41, -0.021), (42, 0.009), (43, 0.01), (44, 0.053), (45, 0.026), (46, -0.049), (47, -0.138), (48, -0.047), (49, -0.157)]
simIndex simValue blogId blogTitle
same-blog 1 0.9880265 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure
Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o
2 0.6848737 4 fast ml-2012-09-17-Best Buy mobile contest
Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351
3 0.43363795 10 fast ml-2012-11-17-The Facebook challenge HOWTO
Introduction: Last time we wrote about the Facebook challenge . Now it’s time for some more details. The main concept is this: in its original state, the data is useless. That’s because there are many names referring to the same entity. Precisely, there are about 350k unique names, and the total number of entities is maybe 20k. So cleaning the data is the first and most important step. Data cleaning That’s the things we do, in order: extract all the names, including numbers, from the graph files and the paths file compute a fingerprint for each name. We used fingerprints similar to ones in Google Refine . Ours not only had sorted tokens (words) in a name, but also sorted letters in a token. Names with the same fingerprint are very likely the same. Remaining distortions are mostly in the form of word combinations: a name a longer name a still longer name this can be dealt with by checking if a given name is a subset of any other name. If it is a subset of only one other
4 0.22830009 5 fast ml-2012-09-19-Best Buy mobile contest - big data
Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before
5 0.11020975 37 fast ml-2013-09-03-Our followers and who else they follow
Introduction: Recently we hit 400 followers mark on Twitter. To celebrate we decided to do some data mining on you , specifically to discover who our followers are and who else they follow. For your viewing pleasure we packaged the results nicely with Bootstrap. Here’s some data science in action. Our followers This table show our 20 most popular followers as measeared by their follower count. The occasional question marks stand for non-ASCII characters. Each link opens a new window. Followers Screen name Name Description 8685 pankaj Pankaj Gupta I lead the Personalization and Recommender Systems group at Twitter. Founded two startups in the past. 5070 ogrisel Olivier Grisel Datageek, contributor to scikit-learn, works with Python / Java / Clojure / Pig, interested in Machine Learning, NLProc, {Big|Linked|Open} Data and braaains! 4582 thuske thuske & 4442 ram Ram Ravichandran So
6 0.092386447 33 fast ml-2013-07-09-Introducing phraug
7 0.089024357 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
8 0.08140561 43 fast ml-2013-11-02-Maxing out the digits
9 0.080221519 14 fast ml-2013-01-04-Madelon: Spearmint's revenge
10 0.078444704 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
11 0.077272348 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
12 0.076303452 19 fast ml-2013-02-07-The secret of the big guys
13 0.075207621 57 fast ml-2014-04-01-Exclusive Geoff Hinton interview
14 0.071978152 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
15 0.068085119 47 fast ml-2013-12-15-A-B testing with bayesian bandits in Google Analytics
16 0.065588497 50 fast ml-2014-01-20-How to get predictions from Pylearn2
17 0.065191813 27 fast ml-2013-05-01-Deep learning made easy
18 0.06488508 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
19 0.064471819 61 fast ml-2014-05-08-Impute missing values with Amelia
20 0.064359218 26 fast ml-2013-04-17-Regression as classification
topicId topicWeight
[(35, 0.012), (69, 0.104), (71, 0.01), (99, 0.746)]
simIndex simValue blogId blogTitle
1 0.99733359 5 fast ml-2012-09-19-Best Buy mobile contest - big data
Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before
same-blog 2 0.97698534 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure
Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o
3 0.77553368 4 fast ml-2012-09-17-Best Buy mobile contest
Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351
4 0.63954139 16 fast ml-2013-01-12-Intro to random forests
Introduction: Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive? First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks. There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting). Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods. Besides effectivnes
5 0.50810093 41 fast ml-2013-10-09-Big data made easy
Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or
6 0.40418226 9 fast ml-2012-10-25-So you want to work for Facebook
7 0.38499099 25 fast ml-2013-04-10-Gender discrimination
8 0.37799025 2 fast ml-2012-08-27-Kaggle job recommendation challenge
9 0.37387004 61 fast ml-2014-05-08-Impute missing values with Amelia
10 0.34159541 35 fast ml-2013-08-12-Accelerometer Biometric Competition
11 0.33919695 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
12 0.32924119 28 fast ml-2013-05-12-And deliver us from Weka
13 0.32739782 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
14 0.32350624 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
15 0.30042869 19 fast ml-2013-02-07-The secret of the big guys
16 0.29888529 43 fast ml-2013-11-02-Maxing out the digits
17 0.29035074 20 fast ml-2013-02-18-Predicting advertised salaries
18 0.28765854 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
19 0.28293616 26 fast ml-2013-04-17-Regression as classification
20 0.28253636 34 fast ml-2013-07-14-Running things on a GPU