fast_ml fast_ml-2012 fast_ml-2012-5 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before
sentIndex sentText sentNum sentScore
1 Last time we talked about the small data branch of Best Buy contest . [sent-1, score-0.438]
2 It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. [sent-3, score-0.261]
3 This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. [sent-5, score-0.734]
4 The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0. [sent-10, score-0.389]
5 In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before. [sent-18, score-0.66]
6 Instead, we will read the benchmark file in parallel with test file, so that when we need benchmark items, we have them handy: 1 2 for line in reader: popular_skus = bench_reader. [sent-19, score-0.87]
7 txt Main difference between the two contests, except data size, is that here we’re dealing with many product categories, not just Xbox games. [sent-26, score-0.451]
8 The benchmark recommends most popular products in a given category , not globally. [sent-27, score-0.501]
9 If we build our query -> product mapping taking categories into account, the score will go up dramatically, as promised. [sent-28, score-0.862]
10 This is left as an exercise for the reader , as some of those academic types say. [sent-29, score-0.638]
wordName wordTfidf (topN-words)
[('categories', 0.312), ('product', 0.265), ('benchmark', 0.256), ('cloud', 0.229), ('reader', 0.207), ('small', 0.177), ('contest', 0.131), ('computing', 0.131), ('read', 0.131), ('contests', 0.13), ('recommends', 0.13), ('xbox', 0.13), ('academic', 0.13), ('account', 0.13), ('contestants', 0.13), ('fun', 0.13), ('impressed', 0.13), ('obtaining', 0.13), ('sized', 0.13), ('talked', 0.13), ('teams', 0.13), ('xml', 0.13), ('line', 0.123), ('query', 0.115), ('category', 0.115), ('forum', 0.115), ('minutes', 0.115), ('types', 0.115), ('item', 0.104), ('items', 0.104), ('buy', 0.104), ('cluster', 0.104), ('except', 0.104), ('exercise', 0.104), ('extension', 0.104), ('managed', 0.104), ('parallel', 0.104), ('six', 0.104), ('talk', 0.104), ('recommendations', 0.095), ('ids', 0.095), ('size', 0.095), ('together', 0.095), ('build', 0.088), ('really', 0.082), ('usage', 0.082), ('dealing', 0.082), ('differ', 0.082), ('left', 0.082), ('taking', 0.082)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000002 5 fast ml-2012-09-19-Best Buy mobile contest - big data
Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before
2 0.24673815 4 fast ml-2012-09-17-Best Buy mobile contest
Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351
3 0.1658964 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure
Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o
4 0.14601281 32 fast ml-2013-07-05-Processing large files, line by line
Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re
5 0.087565042 41 fast ml-2013-10-09-Big data made easy
Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or
6 0.077085175 2 fast ml-2012-08-27-Kaggle job recommendation challenge
7 0.066231996 19 fast ml-2013-02-07-The secret of the big guys
8 0.06304495 33 fast ml-2013-07-09-Introducing phraug
9 0.062868327 20 fast ml-2013-02-18-Predicting advertised salaries
10 0.062484141 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision
11 0.061559644 57 fast ml-2014-04-01-Exclusive Geoff Hinton interview
12 0.061138812 25 fast ml-2013-04-10-Gender discrimination
13 0.05772128 8 fast ml-2012-10-15-Merck challenge
14 0.052094795 35 fast ml-2013-08-12-Accelerometer Biometric Competition
15 0.051549453 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
16 0.051087283 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
17 0.044421598 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
18 0.042426258 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
19 0.04221978 43 fast ml-2013-11-02-Maxing out the digits
20 0.04065479 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
topicId topicWeight
[(0, 0.2), (1, -0.149), (2, 0.091), (3, -0.475), (4, 0.032), (5, -0.075), (6, 0.023), (7, 0.004), (8, 0.0), (9, -0.314), (10, -0.123), (11, 0.057), (12, -0.079), (13, -0.065), (14, 0.065), (15, 0.081), (16, -0.007), (17, -0.133), (18, -0.014), (19, 0.067), (20, 0.095), (21, -0.015), (22, -0.109), (23, -0.052), (24, -0.013), (25, 0.006), (26, -0.233), (27, -0.089), (28, -0.023), (29, -0.067), (30, -0.129), (31, -0.063), (32, -0.042), (33, -0.063), (34, 0.054), (35, -0.037), (36, -0.118), (37, 0.006), (38, -0.08), (39, 0.142), (40, -0.022), (41, 0.184), (42, 0.007), (43, 0.022), (44, -0.069), (45, 0.046), (46, -0.035), (47, 0.268), (48, 0.038), (49, 0.402)]
simIndex simValue blogId blogTitle
same-blog 1 0.98610312 5 fast ml-2012-09-19-Best Buy mobile contest - big data
Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before
2 0.46469393 4 fast ml-2012-09-17-Best Buy mobile contest
Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351
3 0.35166231 32 fast ml-2013-07-05-Processing large files, line by line
Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re
4 0.26047567 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure
Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o
5 0.17377505 41 fast ml-2013-10-09-Big data made easy
Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or
6 0.14814278 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
7 0.1438553 19 fast ml-2013-02-07-The secret of the big guys
8 0.13789253 25 fast ml-2013-04-10-Gender discrimination
9 0.13697758 35 fast ml-2013-08-12-Accelerometer Biometric Competition
10 0.13006817 2 fast ml-2012-08-27-Kaggle job recommendation challenge
11 0.12388714 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
12 0.12342736 8 fast ml-2012-10-15-Merck challenge
13 0.11223497 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
14 0.10847709 34 fast ml-2013-07-14-Running things on a GPU
15 0.10624247 26 fast ml-2013-04-17-Regression as classification
16 0.10032046 61 fast ml-2014-05-08-Impute missing values with Amelia
17 0.09777011 43 fast ml-2013-11-02-Maxing out the digits
18 0.09673395 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
19 0.096598968 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
20 0.096200071 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
topicId topicWeight
[(69, 0.069), (99, 0.822)]
simIndex simValue blogId blogTitle
same-blog 1 0.99308354 5 fast ml-2012-09-19-Best Buy mobile contest - big data
Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before
2 0.96301633 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure
Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o
3 0.74999994 4 fast ml-2012-09-17-Best Buy mobile contest
Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351
4 0.6003477 16 fast ml-2013-01-12-Intro to random forests
Introduction: Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive? First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks. There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting). Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods. Besides effectivnes
5 0.48386353 41 fast ml-2013-10-09-Big data made easy
Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or
6 0.3580094 9 fast ml-2012-10-25-So you want to work for Facebook
7 0.35307398 2 fast ml-2012-08-27-Kaggle job recommendation challenge
8 0.34736514 25 fast ml-2013-04-10-Gender discrimination
9 0.33679882 61 fast ml-2014-05-08-Impute missing values with Amelia
10 0.30314511 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
11 0.29840973 35 fast ml-2013-08-12-Accelerometer Biometric Competition
12 0.29576695 28 fast ml-2013-05-12-And deliver us from Weka
13 0.28969091 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
14 0.2812798 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
15 0.26450416 37 fast ml-2013-09-03-Our followers and who else they follow
16 0.26058537 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
17 0.25797498 19 fast ml-2013-02-07-The secret of the big guys
18 0.25120968 43 fast ml-2013-11-02-Maxing out the digits
19 0.25084874 34 fast ml-2013-07-14-Running things on a GPU
20 0.24775872 20 fast ml-2013-02-18-Predicting advertised salaries