fast_ml fast_ml-2012 fast_ml-2012-4 knowledge-graph by maker-knowledge-mining

4 fast ml-2012-09-17-Best Buy mobile contest


meta infos for this blog

Source: html

Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 There’s a contest on Kaggle called ACM Hackaton . [sent-1, score-0.088]

2 Actually, there are two, one based on small data and one on big data. [sent-2, score-0.08]

3 Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. [sent-3, score-0.669]

4 The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. [sent-4, score-1.134]

5 We are asked to predict an item user clicked given the query. [sent-6, score-0.606]

6 The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. [sent-8, score-0.246]

7 Concretely ;), we’re gonna build a mapping from queries to items, using the training set. [sent-10, score-0.499]

8 It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '3519923': 2}, 'guitarps3': {'2633103': 1} Keys in the dictionary are queries and items are game IDs with click count. [sent-11, score-1.316]

9 One thing maybe worth noting here is that we “prepare” the queries, meaning that “Rock Smith_1” becomes “rocksmith1” - the process is to filter out all non-alphanumeric characters and convert to lower case. [sent-12, score-0.154]

10 This way, queries like “Rocksmith 1” and “Rock Smith_1” are the same. [sent-13, score-0.439]

11 This makes sense, because the intent behind them is the same, only spelling is different. [sent-14, score-0.165]

12 When we are asked to predict an item for a given query, we will check if the query is in our dictionary. [sent-15, score-0.705]

13 If it is, we will recommend up to five IDs found there, starting from the one with most clicks. [sent-16, score-0.292]

14 When there’s less than five IDs, we take the rest from the benchmark. [sent-17, score-0.145]

15 It is very easy because benchmark always recommends five most popular items: [9854804, 2107458, 2541184, 2670133, 2173065] Similarly, if the query is not in our dictionary, we will take all five recommendations from the benchmark. [sent-18, score-0.891]

16 The script takes a few seconds to run and produces a score of 72,8%, while benchmark is 14,5%. [sent-19, score-0.209]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('queries', 0.439), ('items', 0.35), ('query', 0.31), ('clicked', 0.211), ('rock', 0.211), ('rocksmith', 0.211), ('ids', 0.193), ('dictionary', 0.193), ('five', 0.145), ('item', 0.14), ('benchmark', 0.139), ('map', 0.129), ('asked', 0.119), ('contest', 0.088), ('applicable', 0.088), ('behind', 0.088), ('keys', 0.088), ('na', 0.088), ('recommends', 0.088), ('require', 0.088), ('traditional', 0.088), ('xbox', 0.088), ('small', 0.08), ('beating', 0.077), ('click', 0.077), ('concretely', 0.077), ('equally', 0.077), ('filter', 0.077), ('noting', 0.077), ('similarly', 0.077), ('spelling', 0.077), ('starting', 0.077), ('predict', 0.076), ('recommend', 0.07), ('typical', 0.07), ('plus', 0.07), ('buy', 0.07), ('seconds', 0.07), ('recommendations', 0.064), ('users', 0.064), ('call', 0.064), ('game', 0.064), ('prepare', 0.064), ('writing', 0.064), ('given', 0.06), ('ideas', 0.06), ('build', 0.06), ('explanation', 0.06), ('sense', 0.06), ('talking', 0.06)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 4 fast ml-2012-09-17-Best Buy mobile contest

Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351

2 0.31442678 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure

Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o

3 0.24673815 5 fast ml-2012-09-19-Best Buy mobile contest - big data

Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before

4 0.1885727 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision

Introduction: Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. They will evaluate the results using mean average precision, or MAP, metric. Specifically MAP@x - this means they ask us to recommend x items for each user. So what is this MAP? First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is MAP. So now, what is AP, or average precision? It may be that we don’t really need to know. But we probably need to know this: we can recommend at most x items for each user it pays to submit all x recommendations, because we are not penalized for bad guesses order matters, so it’s better to submit more certain recommendations fi

5 0.11024731 2 fast ml-2012-08-27-Kaggle job recommendation challenge

Introduction: This is an introduction to Kaggle job recommendation challenge . It looks a lot like a typical collaborative filtering thing (with a lot of extra information), but not quite. Spot these two big differences: There are no explicit ratings. Instead, there’s info about which jobs user applied to. This is known as one-class collaborative filtering (OCCF), or learning from positive-only feedback. If you want to dig deeper into the subject, there have been already contests with positive feedback only, for example track two of Yahoo KDD Cup or Millions Songs Dataset Challenge at Kaggle (both about songs). The second difference is less apparent. When you look at test users (that is, the users that we are asked to recommend jobs for), only about half of them made at least one application. For the other half, no data and no collaborative filtering. For the users we have applications data for, it’s very sparse, so we would like to use CF, because it does well in similar se

6 0.05705696 19 fast ml-2013-02-07-The secret of the big guys

7 0.055348571 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

8 0.054075394 25 fast ml-2013-04-10-Gender discrimination

9 0.053018004 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial

10 0.050431527 10 fast ml-2012-11-17-The Facebook challenge HOWTO

11 0.042774387 20 fast ml-2013-02-18-Predicting advertised salaries

12 0.040711269 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

13 0.040413532 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

14 0.03707229 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit

15 0.035973765 41 fast ml-2013-10-09-Big data made easy

16 0.034236811 9 fast ml-2012-10-25-So you want to work for Facebook

17 0.032290339 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

18 0.032244477 47 fast ml-2013-12-15-A-B testing with bayesian bandits in Google Analytics

19 0.031980023 32 fast ml-2013-07-05-Processing large files, line by line

20 0.030953938 27 fast ml-2013-05-01-Deep learning made easy


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.174), (1, -0.152), (2, 0.08), (3, -0.635), (4, 0.06), (5, -0.174), (6, 0.096), (7, -0.194), (8, 0.007), (9, -0.147), (10, -0.021), (11, 0.03), (12, -0.074), (13, 0.072), (14, -0.028), (15, 0.066), (16, 0.039), (17, -0.049), (18, -0.066), (19, -0.018), (20, -0.023), (21, 0.006), (22, 0.068), (23, 0.045), (24, -0.027), (25, 0.02), (26, 0.052), (27, 0.032), (28, -0.075), (29, 0.019), (30, 0.073), (31, -0.034), (32, 0.075), (33, -0.071), (34, 0.029), (35, -0.039), (36, -0.062), (37, 0.07), (38, -0.031), (39, -0.023), (40, -0.085), (41, -0.055), (42, -0.059), (43, -0.11), (44, -0.075), (45, 0.091), (46, -0.044), (47, -0.009), (48, -0.004), (49, -0.256)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98595506 4 fast ml-2012-09-17-Best Buy mobile contest

Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351

2 0.68141067 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure

Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o

3 0.40585858 5 fast ml-2012-09-19-Best Buy mobile contest - big data

Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before

4 0.35440114 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision

Introduction: Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. They will evaluate the results using mean average precision, or MAP, metric. Specifically MAP@x - this means they ask us to recommend x items for each user. So what is this MAP? First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is MAP. So now, what is AP, or average precision? It may be that we don’t really need to know. But we probably need to know this: we can recommend at most x items for each user it pays to submit all x recommendations, because we are not penalized for bad guesses order matters, so it’s better to submit more certain recommendations fi

5 0.21577471 2 fast ml-2012-08-27-Kaggle job recommendation challenge

Introduction: This is an introduction to Kaggle job recommendation challenge . It looks a lot like a typical collaborative filtering thing (with a lot of extra information), but not quite. Spot these two big differences: There are no explicit ratings. Instead, there’s info about which jobs user applied to. This is known as one-class collaborative filtering (OCCF), or learning from positive-only feedback. If you want to dig deeper into the subject, there have been already contests with positive feedback only, for example track two of Yahoo KDD Cup or Millions Songs Dataset Challenge at Kaggle (both about songs). The second difference is less apparent. When you look at test users (that is, the users that we are asked to recommend jobs for), only about half of them made at least one application. For the other half, no data and no collaborative filtering. For the users we have applications data for, it’s very sparse, so we would like to use CF, because it does well in similar se

6 0.130743 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial

7 0.12603372 20 fast ml-2013-02-18-Predicting advertised salaries

8 0.1187899 9 fast ml-2012-10-25-So you want to work for Facebook

9 0.11744587 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

10 0.10751939 25 fast ml-2013-04-10-Gender discrimination

11 0.098767318 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

12 0.097729392 19 fast ml-2013-02-07-The secret of the big guys

13 0.097698629 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

14 0.093532592 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

15 0.087113574 40 fast ml-2013-10-06-Pylearn2 in practice

16 0.083876289 27 fast ml-2013-05-01-Deep learning made easy

17 0.078333892 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers

18 0.077298887 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

19 0.073878035 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

20 0.073480763 28 fast ml-2013-05-12-And deliver us from Weka


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(6, 0.015), (26, 0.035), (51, 0.023), (55, 0.021), (69, 0.122), (90, 0.393), (99, 0.277)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.88296753 4 fast ml-2012-09-17-Best Buy mobile contest

Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351

2 0.60413492 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure

Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o

3 0.59294719 5 fast ml-2012-09-19-Best Buy mobile contest - big data

Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before

4 0.46608996 16 fast ml-2013-01-12-Intro to random forests

Introduction: Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive? First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks. There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting). Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods. Besides effectivnes

5 0.36651936 41 fast ml-2013-10-09-Big data made easy

Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or

6 0.36000097 9 fast ml-2012-10-25-So you want to work for Facebook

7 0.3489731 25 fast ml-2013-04-10-Gender discrimination

8 0.33690447 35 fast ml-2013-08-12-Accelerometer Biometric Competition

9 0.33646557 61 fast ml-2014-05-08-Impute missing values with Amelia

10 0.32624018 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

11 0.32531118 2 fast ml-2012-08-27-Kaggle job recommendation challenge

12 0.31847474 43 fast ml-2013-11-02-Maxing out the digits

13 0.31385082 20 fast ml-2013-02-18-Predicting advertised salaries

14 0.30707014 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

15 0.29972738 19 fast ml-2013-02-07-The secret of the big guys

16 0.29954204 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

17 0.29875869 17 fast ml-2013-01-14-Feature selection in practice

18 0.29591259 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

19 0.29210836 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers

20 0.29181987 27 fast ml-2013-05-01-Deep learning made easy