fast_ml fast_ml-2012 fast_ml-2012-4 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351
sentIndex sentText sentNum sentScore
1 There’s a contest on Kaggle called ACM Hackaton . [sent-1, score-0.088]
2 Actually, there are two, one based on small data and one on big data. [sent-2, score-0.08]
3 Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. [sent-3, score-0.669]
4 The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. [sent-4, score-1.134]
5 We are asked to predict an item user clicked given the query. [sent-6, score-0.606]
6 The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. [sent-8, score-0.246]
7 Concretely ;), we’re gonna build a mapping from queries to items, using the training set. [sent-10, score-0.499]
8 It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '3519923': 2}, 'guitarps3': {'2633103': 1} Keys in the dictionary are queries and items are game IDs with click count. [sent-11, score-1.316]
9 One thing maybe worth noting here is that we “prepare” the queries, meaning that “Rock Smith_1” becomes “rocksmith1” - the process is to filter out all non-alphanumeric characters and convert to lower case. [sent-12, score-0.154]
10 This way, queries like “Rocksmith 1” and “Rock Smith_1” are the same. [sent-13, score-0.439]
11 This makes sense, because the intent behind them is the same, only spelling is different. [sent-14, score-0.165]
12 When we are asked to predict an item for a given query, we will check if the query is in our dictionary. [sent-15, score-0.705]
13 If it is, we will recommend up to five IDs found there, starting from the one with most clicks. [sent-16, score-0.292]
14 When there’s less than five IDs, we take the rest from the benchmark. [sent-17, score-0.145]
15 It is very easy because benchmark always recommends five most popular items: [9854804, 2107458, 2541184, 2670133, 2173065] Similarly, if the query is not in our dictionary, we will take all five recommendations from the benchmark. [sent-18, score-0.891]
16 The script takes a few seconds to run and produces a score of 72,8%, while benchmark is 14,5%. [sent-19, score-0.209]
wordName wordTfidf (topN-words)
[('queries', 0.439), ('items', 0.35), ('query', 0.31), ('clicked', 0.211), ('rock', 0.211), ('rocksmith', 0.211), ('ids', 0.193), ('dictionary', 0.193), ('five', 0.145), ('item', 0.14), ('benchmark', 0.139), ('map', 0.129), ('asked', 0.119), ('contest', 0.088), ('applicable', 0.088), ('behind', 0.088), ('keys', 0.088), ('na', 0.088), ('recommends', 0.088), ('require', 0.088), ('traditional', 0.088), ('xbox', 0.088), ('small', 0.08), ('beating', 0.077), ('click', 0.077), ('concretely', 0.077), ('equally', 0.077), ('filter', 0.077), ('noting', 0.077), ('similarly', 0.077), ('spelling', 0.077), ('starting', 0.077), ('predict', 0.076), ('recommend', 0.07), ('typical', 0.07), ('plus', 0.07), ('buy', 0.07), ('seconds', 0.07), ('recommendations', 0.064), ('users', 0.064), ('call', 0.064), ('game', 0.064), ('prepare', 0.064), ('writing', 0.064), ('given', 0.06), ('ideas', 0.06), ('build', 0.06), ('explanation', 0.06), ('sense', 0.06), ('talking', 0.06)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000002 4 fast ml-2012-09-17-Best Buy mobile contest
Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351
2 0.31442678 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure
Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o
3 0.24673815 5 fast ml-2012-09-19-Best Buy mobile contest - big data
Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before
4 0.1885727 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
Introduction: Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. They will evaluate the results using mean average precision, or MAP, metric. Specifically MAP@x - this means they ask us to recommend x items for each user. So what is this MAP? First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is MAP. So now, what is AP, or average precision? It may be that we don’t really need to know. But we probably need to know this: we can recommend at most x items for each user it pays to submit all x recommendations, because we are not penalized for bad guesses order matters, so it’s better to submit more certain recommendations fi
5 0.11024731 2 fast ml-2012-08-27-Kaggle job recommendation challenge
Introduction: This is an introduction to Kaggle job recommendation challenge . It looks a lot like a typical collaborative filtering thing (with a lot of extra information), but not quite. Spot these two big differences: There are no explicit ratings. Instead, there’s info about which jobs user applied to. This is known as one-class collaborative filtering (OCCF), or learning from positive-only feedback. If you want to dig deeper into the subject, there have been already contests with positive feedback only, for example track two of Yahoo KDD Cup or Millions Songs Dataset Challenge at Kaggle (both about songs). The second difference is less apparent. When you look at test users (that is, the users that we are asked to recommend jobs for), only about half of them made at least one application. For the other half, no data and no collaborative filtering. For the users we have applications data for, it’s very sparse, so we would like to use CF, because it does well in similar se
6 0.05705696 19 fast ml-2013-02-07-The secret of the big guys
7 0.055348571 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
8 0.054075394 25 fast ml-2013-04-10-Gender discrimination
9 0.053018004 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
10 0.050431527 10 fast ml-2012-11-17-The Facebook challenge HOWTO
11 0.042774387 20 fast ml-2013-02-18-Predicting advertised salaries
12 0.040711269 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
13 0.040413532 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
14 0.03707229 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
15 0.035973765 41 fast ml-2013-10-09-Big data made easy
16 0.034236811 9 fast ml-2012-10-25-So you want to work for Facebook
17 0.032290339 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
18 0.032244477 47 fast ml-2013-12-15-A-B testing with bayesian bandits in Google Analytics
19 0.031980023 32 fast ml-2013-07-05-Processing large files, line by line
20 0.030953938 27 fast ml-2013-05-01-Deep learning made easy
topicId topicWeight
[(0, 0.174), (1, -0.152), (2, 0.08), (3, -0.635), (4, 0.06), (5, -0.174), (6, 0.096), (7, -0.194), (8, 0.007), (9, -0.147), (10, -0.021), (11, 0.03), (12, -0.074), (13, 0.072), (14, -0.028), (15, 0.066), (16, 0.039), (17, -0.049), (18, -0.066), (19, -0.018), (20, -0.023), (21, 0.006), (22, 0.068), (23, 0.045), (24, -0.027), (25, 0.02), (26, 0.052), (27, 0.032), (28, -0.075), (29, 0.019), (30, 0.073), (31, -0.034), (32, 0.075), (33, -0.071), (34, 0.029), (35, -0.039), (36, -0.062), (37, 0.07), (38, -0.031), (39, -0.023), (40, -0.085), (41, -0.055), (42, -0.059), (43, -0.11), (44, -0.075), (45, 0.091), (46, -0.044), (47, -0.009), (48, -0.004), (49, -0.256)]
simIndex simValue blogId blogTitle
same-blog 1 0.98595506 4 fast ml-2012-09-17-Best Buy mobile contest
Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351
2 0.68141067 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure
Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o
3 0.40585858 5 fast ml-2012-09-19-Best Buy mobile contest - big data
Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before
4 0.35440114 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
Introduction: Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. They will evaluate the results using mean average precision, or MAP, metric. Specifically MAP@x - this means they ask us to recommend x items for each user. So what is this MAP? First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is MAP. So now, what is AP, or average precision? It may be that we don’t really need to know. But we probably need to know this: we can recommend at most x items for each user it pays to submit all x recommendations, because we are not penalized for bad guesses order matters, so it’s better to submit more certain recommendations fi
5 0.21577471 2 fast ml-2012-08-27-Kaggle job recommendation challenge
Introduction: This is an introduction to Kaggle job recommendation challenge . It looks a lot like a typical collaborative filtering thing (with a lot of extra information), but not quite. Spot these two big differences: There are no explicit ratings. Instead, there’s info about which jobs user applied to. This is known as one-class collaborative filtering (OCCF), or learning from positive-only feedback. If you want to dig deeper into the subject, there have been already contests with positive feedback only, for example track two of Yahoo KDD Cup or Millions Songs Dataset Challenge at Kaggle (both about songs). The second difference is less apparent. When you look at test users (that is, the users that we are asked to recommend jobs for), only about half of them made at least one application. For the other half, no data and no collaborative filtering. For the users we have applications data for, it’s very sparse, so we would like to use CF, because it does well in similar se
6 0.130743 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
7 0.12603372 20 fast ml-2013-02-18-Predicting advertised salaries
8 0.1187899 9 fast ml-2012-10-25-So you want to work for Facebook
9 0.11744587 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
10 0.10751939 25 fast ml-2013-04-10-Gender discrimination
11 0.098767318 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
12 0.097729392 19 fast ml-2013-02-07-The secret of the big guys
13 0.097698629 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
14 0.093532592 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
15 0.087113574 40 fast ml-2013-10-06-Pylearn2 in practice
16 0.083876289 27 fast ml-2013-05-01-Deep learning made easy
17 0.078333892 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
18 0.077298887 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
19 0.073878035 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
20 0.073480763 28 fast ml-2013-05-12-And deliver us from Weka
topicId topicWeight
[(6, 0.015), (26, 0.035), (51, 0.023), (55, 0.021), (69, 0.122), (90, 0.393), (99, 0.277)]
simIndex simValue blogId blogTitle
same-blog 1 0.88296753 4 fast ml-2012-09-17-Best Buy mobile contest
Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351
2 0.60413492 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure
Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o
3 0.59294719 5 fast ml-2012-09-19-Best Buy mobile contest - big data
Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before
4 0.46608996 16 fast ml-2013-01-12-Intro to random forests
Introduction: Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive? First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks. There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting). Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods. Besides effectivnes
5 0.36651936 41 fast ml-2013-10-09-Big data made easy
Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or
6 0.36000097 9 fast ml-2012-10-25-So you want to work for Facebook
7 0.3489731 25 fast ml-2013-04-10-Gender discrimination
8 0.33690447 35 fast ml-2013-08-12-Accelerometer Biometric Competition
9 0.33646557 61 fast ml-2014-05-08-Impute missing values with Amelia
10 0.32624018 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
11 0.32531118 2 fast ml-2012-08-27-Kaggle job recommendation challenge
12 0.31847474 43 fast ml-2013-11-02-Maxing out the digits
13 0.31385082 20 fast ml-2013-02-18-Predicting advertised salaries
14 0.30707014 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
15 0.29972738 19 fast ml-2013-02-07-The secret of the big guys
16 0.29954204 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
17 0.29875869 17 fast ml-2013-01-14-Feature selection in practice
18 0.29591259 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
19 0.29210836 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
20 0.29181987 27 fast ml-2013-05-01-Deep learning made easy