fast_ml fast_ml-2012 fast_ml-2012-6 knowledge-graph by maker-knowledge-mining

6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure


meta infos for this blog

Source: html

Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. [sent-1, score-0.906]

2 If that is not enough, then we go to the benchmark. [sent-3, score-0.053]

3 We will spell-correct test queries when looking in our query -> sku mapping. [sent-6, score-0.944]

4 Edit distance is one (just adding “e”), so we will catch that easily. [sent-13, score-0.201]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('search', 0.357), ('queries', 0.322), ('sku', 0.29), ('query', 0.284), ('product', 0.273), ('matches', 0.241), ('spelling', 0.213), ('ace', 0.193), ('combat', 0.193), ('corrected', 0.193), ('correction', 0.193), ('skus', 0.193), ('names', 0.162), ('fill', 0.142), ('name', 0.11), ('five', 0.1), ('less', 0.095), ('mapping', 0.085), ('found', 0.083), ('correct', 0.08), ('xml', 0.08), ('edit', 0.08), ('hits', 0.08), ('improving', 0.08), ('processed', 0.08), ('searching', 0.08), ('waiting', 0.08), ('catch', 0.071), ('distance', 0.071), ('places', 0.071), ('norvig', 0.064), ('ten', 0.064), ('benchmark', 0.063), ('adding', 0.059), ('http', 0.059), ('improve', 0.059), ('ideas', 0.055), ('peter', 0.055), ('go', 0.053), ('taking', 0.051), ('follows', 0.051), ('wait', 0.051), ('results', 0.051), ('sum', 0.048), ('looking', 0.048), ('leaderboard', 0.045), ('extract', 0.043), ('final', 0.043), ('instead', 0.041), ('sure', 0.04)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999976 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure

Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o

2 0.31442678 4 fast ml-2012-09-17-Best Buy mobile contest

Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351

3 0.2097194 10 fast ml-2012-11-17-The Facebook challenge HOWTO

Introduction: Last time we wrote about the Facebook challenge . Now it’s time for some more details. The main concept is this: in its original state, the data is useless. That’s because there are many names referring to the same entity. Precisely, there are about 350k unique names, and the total number of entities is maybe 20k. So cleaning the data is the first and most important step. Data cleaning That’s the things we do, in order: extract all the names, including numbers, from the graph files and the paths file compute a fingerprint for each name. We used fingerprints similar to ones in Google Refine . Ours not only had sorted tokens (words) in a name, but also sorted letters in a token. Names with the same fingerprint are very likely the same. Remaining distortions are mostly in the form of word combinations: a name a longer name a still longer name this can be dealt with by checking if a given name is a subset of any other name. If it is a subset of only one other

4 0.1658964 5 fast ml-2012-09-19-Best Buy mobile contest - big data

Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before

5 0.045751207 19 fast ml-2013-02-07-The secret of the big guys

Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni

6 0.041509125 37 fast ml-2013-09-03-Our followers and who else they follow

7 0.041497327 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

8 0.040676199 14 fast ml-2013-01-04-Madelon: Spearmint's revenge

9 0.037523836 43 fast ml-2013-11-02-Maxing out the digits

10 0.037084449 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial

11 0.037079439 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision

12 0.035358377 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

13 0.033557903 57 fast ml-2014-04-01-Exclusive Geoff Hinton interview

14 0.032964986 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

15 0.029984774 46 fast ml-2013-12-07-13 NIPS papers that caught our eye

16 0.029311934 25 fast ml-2013-04-10-Gender discrimination

17 0.028998194 17 fast ml-2013-01-14-Feature selection in practice

18 0.028837735 47 fast ml-2013-12-15-A-B testing with bayesian bandits in Google Analytics

19 0.028693188 9 fast ml-2012-10-25-So you want to work for Facebook

20 0.026119674 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.146), (1, -0.103), (2, 0.059), (3, -0.581), (4, 0.058), (5, -0.179), (6, 0.075), (7, -0.171), (8, -0.063), (9, 0.107), (10, -0.088), (11, 0.001), (12, 0.241), (13, -0.084), (14, 0.031), (15, 0.199), (16, 0.023), (17, -0.166), (18, -0.048), (19, 0.011), (20, 0.081), (21, 0.031), (22, -0.029), (23, 0.014), (24, -0.115), (25, -0.007), (26, 0.101), (27, 0.089), (28, 0.025), (29, 0.046), (30, 0.142), (31, 0.099), (32, 0.095), (33, 0.112), (34, 0.004), (35, 0.008), (36, -0.03), (37, -0.031), (38, 0.061), (39, -0.098), (40, -0.007), (41, -0.021), (42, 0.009), (43, 0.01), (44, 0.053), (45, 0.026), (46, -0.049), (47, -0.138), (48, -0.047), (49, -0.157)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9880265 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure

Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o

2 0.6848737 4 fast ml-2012-09-17-Best Buy mobile contest

Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351

3 0.43363795 10 fast ml-2012-11-17-The Facebook challenge HOWTO

Introduction: Last time we wrote about the Facebook challenge . Now it’s time for some more details. The main concept is this: in its original state, the data is useless. That’s because there are many names referring to the same entity. Precisely, there are about 350k unique names, and the total number of entities is maybe 20k. So cleaning the data is the first and most important step. Data cleaning That’s the things we do, in order: extract all the names, including numbers, from the graph files and the paths file compute a fingerprint for each name. We used fingerprints similar to ones in Google Refine . Ours not only had sorted tokens (words) in a name, but also sorted letters in a token. Names with the same fingerprint are very likely the same. Remaining distortions are mostly in the form of word combinations: a name a longer name a still longer name this can be dealt with by checking if a given name is a subset of any other name. If it is a subset of only one other

4 0.22830009 5 fast ml-2012-09-19-Best Buy mobile contest - big data

Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before

5 0.11020975 37 fast ml-2013-09-03-Our followers and who else they follow

Introduction: Recently we hit 400 followers mark on Twitter. To celebrate we decided to do some data mining on you , specifically to discover who our followers are and who else they follow. For your viewing pleasure we packaged the results nicely with Bootstrap. Here’s some data science in action. Our followers This table show our 20 most popular followers as measeared by their follower count. The occasional question marks stand for non-ASCII characters. Each link opens a new window.   Followers Screen name Name Description 8685 pankaj Pankaj Gupta I lead the Personalization and Recommender Systems group at Twitter. Founded two startups in the past.   5070 ogrisel Olivier Grisel Datageek, contributor to scikit-learn, works with Python / Java / Clojure / Pig, interested in Machine Learning, NLProc, {Big|Linked|Open} Data and braaains!   4582 thuske thuske & 4442 ram Ram Ravichandran So

6 0.092386447 33 fast ml-2013-07-09-Introducing phraug

7 0.089024357 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

8 0.08140561 43 fast ml-2013-11-02-Maxing out the digits

9 0.080221519 14 fast ml-2013-01-04-Madelon: Spearmint's revenge

10 0.078444704 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial

11 0.077272348 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

12 0.076303452 19 fast ml-2013-02-07-The secret of the big guys

13 0.075207621 57 fast ml-2014-04-01-Exclusive Geoff Hinton interview

14 0.071978152 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

15 0.068085119 47 fast ml-2013-12-15-A-B testing with bayesian bandits in Google Analytics

16 0.065588497 50 fast ml-2014-01-20-How to get predictions from Pylearn2

17 0.065191813 27 fast ml-2013-05-01-Deep learning made easy

18 0.06488508 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

19 0.064471819 61 fast ml-2014-05-08-Impute missing values with Amelia

20 0.064359218 26 fast ml-2013-04-17-Regression as classification


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(35, 0.012), (69, 0.104), (71, 0.01), (99, 0.746)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.99733359 5 fast ml-2012-09-19-Best Buy mobile contest - big data

Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before

same-blog 2 0.97698534 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure

Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o

3 0.77553368 4 fast ml-2012-09-17-Best Buy mobile contest

Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351

4 0.63954139 16 fast ml-2013-01-12-Intro to random forests

Introduction: Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive? First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks. There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting). Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods. Besides effectivnes

5 0.50810093 41 fast ml-2013-10-09-Big data made easy

Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or

6 0.40418226 9 fast ml-2012-10-25-So you want to work for Facebook

7 0.38499099 25 fast ml-2013-04-10-Gender discrimination

8 0.37799025 2 fast ml-2012-08-27-Kaggle job recommendation challenge

9 0.37387004 61 fast ml-2014-05-08-Impute missing values with Amelia

10 0.34159541 35 fast ml-2013-08-12-Accelerometer Biometric Competition

11 0.33919695 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

12 0.32924119 28 fast ml-2013-05-12-And deliver us from Weka

13 0.32739782 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

14 0.32350624 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers

15 0.30042869 19 fast ml-2013-02-07-The secret of the big guys

16 0.29888529 43 fast ml-2013-11-02-Maxing out the digits

17 0.29035074 20 fast ml-2013-02-18-Predicting advertised salaries

18 0.28765854 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

19 0.28293616 26 fast ml-2013-04-17-Regression as classification

20 0.28253636 34 fast ml-2013-07-14-Running things on a GPU