fast_ml fast_ml-2012 knowledge-graph by maker-knowledge-mining

fast_ml 2012 knowledge graph


similar blogs computed by tfidf model


similar blogs computed by lsi model


similar blogs computed by lda model


blogs list:

1 fast ml-2012-12-27-Spearmint with a random forest

Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe

2 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

3 fast ml-2012-12-07-Predicting wine quality

Introduction: This post is as much about wine as it is about machine learning, so if you enjoy wine, like we do, you may find it especially interesting. Here’s some R and Matlab code , and if you want to get right to the point, skip to the charts . There’s a book by Philipp Janert called Data Analysis with Open Source Tools , which, by the way, we would recommend. From this book we found out about the wine quality datasets . There are two, one for red wine and one for white wine, and they are interesting because they contain quality ratings (1 - 10) for a few thousands of wines, along with their physical and chemical properties. We could probably use these properties to predict a rating for a wine. We’ll be looking at white and red wine separately for the reasons you will see shortly. Principal component analysis for white wine Janert performs a principal component analysis (PCA) and shows a resulting plot for white wine. What’s interesting about this plot is that judging by the first tw

4 fast ml-2012-11-17-The Facebook challenge HOWTO

Introduction: Last time we wrote about the Facebook challenge . Now it’s time for some more details. The main concept is this: in its original state, the data is useless. That’s because there are many names referring to the same entity. Precisely, there are about 350k unique names, and the total number of entities is maybe 20k. So cleaning the data is the first and most important step. Data cleaning That’s the things we do, in order: extract all the names, including numbers, from the graph files and the paths file compute a fingerprint for each name. We used fingerprints similar to ones in Google Refine . Ours not only had sorted tokens (words) in a name, but also sorted letters in a token. Names with the same fingerprint are very likely the same. Remaining distortions are mostly in the form of word combinations: a name a longer name a still longer name this can be dealt with by checking if a given name is a subset of any other name. If it is a subset of only one other

5 fast ml-2012-10-25-So you want to work for Facebook

Introduction: Good news, everyone! There’s a new contest on Kaggle - Facebook is looking for talent . They won’t pay, but just might interview. This post is in a way a bonus for active readers because most visitors of fastml.com originally come from Kaggle forums. For this competition the forums are disabled to encourage own work . To honor this, we won’t publish any code. But own work doesn’t mean original work , and we wouldn’t want to reinvent the wheel, would we? The contest differs substantially from a Kaggle stereotype, if there is such a thing, in three major ways: there’s no money prizes, as mentioned above it’s not a real world problem, but rather an assignment to screen job candidates (this has important consequences, described below) it’s not a typical machine learning project, but rather a broader AI exercise You are given a graph of the internet, actually a snapshot of the graph for each of 15 time steps. You are also given a bunch of paths in this graph, which a

6 fast ml-2012-10-15-Merck challenge

Introduction: Today it’s about Merck challenge - let’s beat the benchmark real quick. Not by much, but quick. If you look at the source code, you’ll notice this: # data sets 1 and 6 are too large to fit into memory and run basic # random forest. Sample 20% of data set instead. if (i== 1 | i==6) { Nrows = length(train[,1]) train <- train[sample(Nrows, as.integer(0.2*Nrows)),] } It means that sets one and six are used in 20% only. This suggests an angle of attack, because more data beats a cleverer algorithm [1]. Let’s try our pal VW. We’ll just convert those sets to Vowpal Wabbit format , run training, run prediction, and convert results to Kaggle format. OK, 0.39, we’re done for the evening. Earlier we tried training a random forest implementation which could take the whole set into memory, but it took maybe an hour to run anyway, and the result of that first attempt wasn’t so good. We’re not into this kind of tempo so we explored other possibilities. [1] Pedro Do

7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

8 fast ml-2012-09-25-Best Buy mobile contest - full disclosure

Introduction: Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction. As far as product names go, we will use product data available in XML format to extract SKU and name for each product: 8564564 Ace Combat 6: Fires of Liberation Platinum Hits 2755149 Ace Combat: Assault Horizon 1208344 Adrenalin Misfits Further, we will process the names in the same way we processed queries: 8564564 acecombat6firesofliberationplatinumhits 2755149 acecombatassaulthorizon 1208344 adrenalinmisfits When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark. But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire ( lanoire ), but not for L.A. Noir ( lanoir ). It is easy to see that this is basically the same query, o

9 fast ml-2012-09-19-Best Buy mobile contest - big data

Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before

10 fast ml-2012-09-17-Best Buy mobile contest

Introduction: There’s a contest on Kaggle called ACM Hackaton . Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both. The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP ). The problem isn’t typical for Kaggle, because it doesn’t really require using machine learning in traditional sense. To beat the benchmark, it’s enough to write a short script. Concretely ;), we’re gonna build a mapping from queries to items, using the training set. It will be just a Python dictionary looking like this: 'forzasteeringwheel': {'2078113': 1}, 'finalfantasy13': {'9461183': 3, '351

11 fast ml-2012-09-01-Running Unix apps on Windows

Introduction: When it comes to machine learning, most software seems to be in either Python, Matlab or R. Plus native apps, that is, compiled C/C++. These are fastest. Most of them is written for Unix environments, for example Linux or MacOS. So how do you run them on your computer if you have Windows installed? Back in the day, you re-partitioned your hard drive and installed Linux alongside Windows. The added thrill was, if something went wrong, your computer wouldn’t boot. Now it’s easier. You just run Linux inside Windows, using what’s called a virtual machine. You need virtualization software and a machine image to do so. Most popular software seems to be VMware. There is also VirtualBox - it is able to run VMware images. We have experience with WMware mostly, so this is what we’ll refer to. VMware player is free to download and use. There are also many images available, of various flavours of Linux and other operating systems. In fact, you can run Windows inside Linux if you wish

12 fast ml-2012-08-27-Kaggle job recommendation challenge

Introduction: This is an introduction to Kaggle job recommendation challenge . It looks a lot like a typical collaborative filtering thing (with a lot of extra information), but not quite. Spot these two big differences: There are no explicit ratings. Instead, there’s info about which jobs user applied to. This is known as one-class collaborative filtering (OCCF), or learning from positive-only feedback. If you want to dig deeper into the subject, there have been already contests with positive feedback only, for example track two of Yahoo KDD Cup or Millions Songs Dataset Challenge at Kaggle (both about songs). The second difference is less apparent. When you look at test users (that is, the users that we are asked to recommend jobs for), only about half of them made at least one application. For the other half, no data and no collaborative filtering. For the users we have applications data for, it’s very sparse, so we would like to use CF, because it does well in similar se

13 fast ml-2012-08-09-What you wanted to know about Mean Average Precision

Introduction: Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. They will evaluate the results using mean average precision, or MAP, metric. Specifically MAP@x - this means they ask us to recommend x items for each user. So what is this MAP? First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is MAP. So now, what is AP, or average precision? It may be that we don’t really need to know. But we probably need to know this: we can recommend at most x items for each user it pays to submit all x recommendations, because we are not penalized for bad guesses order matters, so it’s better to submit more certain recommendations fi