fast_ml fast_ml-2013 knowledge-graph by maker-knowledge-mining

fast_ml 2013 knowledge graph


similar blogs computed by tfidf model


similar blogs computed by lsi model


similar blogs computed by lda model


blogs list:

1 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

Introduction: We continue with CIFAR-10-based competition at Kaggle to get to know DropConnect. It’s supposed to be an improvement over dropout. And dropout is certainly one of the bigger steps forward in neural network development. Is DropConnect really better than dropout? TL;DR DropConnect seems to offer results similiar to dropout. State of the art scores reported in the paper come from model ensembling. Dropout Dropout , by Hinton et al., is perhaps a biggest invention in the field of neural networks in recent years. It adresses the main problem in machine learning, that is overfitting. It does so by “dropping out” some unit activations in a given layer, that is setting them to zero. Thus it prevents co-adaptation of units and can also be seen as a method of ensembling many networks sharing the same weights. For each training example a different set of units to drop is randomly chosen. The idea has a biological inspiration . When a child is conceived, it receives half its genes f

2 fast ml-2013-12-15-A-B testing with bayesian bandits in Google Analytics

Introduction: A/B testing is a way to optimize a web page. Half of visitors see one version, the other half another, so you can tell which version is more conducive to your goal - for example selling something. Since June 2013 A/B testing can be conveniently done with Google Analytics. Here’s how. This article is not quite about machine learning. If you’re not interested in testing, scroll down to the bayesian bandits section . Google Content Experiments We remember Google Website Optimizer from a few years ago. It wasn’t exactly user friendly or slick, but it felt solid and did the job. Unfortunately, at one point in time Google pulled the plug, leaving Genetify as a sole free (and open source) tool for multivariate testing. Multivariate means testing a few elements on a page simultanously. At that time they launched Content Experiments in Google Analytics, but it was a giant step backward. Content experiments were very primitive and only allowed rudimentary A/B split testing. It i

3 fast ml-2013-12-07-13 NIPS papers that caught our eye

Introduction: Recently Rob Zinkov published his selection of interesting-looking NIPS papers . Inspired by this, we list some more. Rob seems to like Bayesian stuff, we’re more into neural networks. If you feel like browsing, Andrej Karpathy has a page with all NIPS 2013 papers . They are categorized by topics discovered by running LDA. When you see an interesting paper, you can discover ones ranked similiar by TF-IDF. Here’s what we found. Understanding Dropout Pierre Baldi, Peter J. Sadowski Dropout is a relatively new algorithm for training neural networks which relies on stochastically dropping out neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characte

4 fast ml-2013-11-27-Object recognition in images with cuda-convnet

Introduction: Object recognition in images is where deep learning, and specifically convolutional neural networks, are often applied and benchmarked these days. To get a piece of the action, we’ll be using Alex Krizhevsky’s cuda-convnet , a shining diamond of machine learning software, in a Kaggle competition. Continuing to run things on a GPU, we turn to applying convolutional neural networks for object recognition. This kind of network was developed by Yann LeCun and it’s powerful, but a bit complicated: Image credit: EBLearn tutorial A typical convolutional network has two parts. The first is responsible for feature extraction and consists of one or more pairs of convolution and subsampling/max-pooling layers, as you can see above. The second part is just a classic fully-connected multilayer perceptron taking extracted features as input. For a detailed explanation of all this see unit 9 in Hugo LaRochelle’s neural networks course . Daniel Nouri has an interesting story about

5 fast ml-2013-11-18-CUDA on a Linux laptop

Introduction: After testing CUDA on a desktop , we now switch to a Linux laptop with 64-bit Xubuntu. Getting CUDA to work is harder here. Will the effort be worth the results? If you have a laptop with a Nvidia card, the thing probably uses it for 3D graphics and Intel’s built-in unit for everything else. This technology is known as Optimus and it happens to make things anything but easy for running CUDA on Linux. The problem is with GPU drivers, specifically between Linux being open-source and Nvidia drivers being not. This strained relation at one time prompted Linus Torvalds to give Nvidia a finger with great passion. Installing GPU drivers Here’s a solution for the driver problem. You need a package called bumblebee . It makes a Nvidia card accessible to your apps. To install drivers and bumblebee , try something along these lines: sudo apt-get install nvidia-current-updates sudo apt-get install bumblebee Note that you don’t need drivers that come with a specific CUDA rel

6 fast ml-2013-11-02-Maxing out the digits

Introduction: Recently we’ve been investigating the basics of Pylearn2 . Now it’s time for a more advanced example: a multilayer perceptron with dropout and maxout activation for the MNIST digits. Maxout explained If you’ve been following developments in deep learning, you know that Hinton’s most recent recommendation for supervised learning, after a few years of bashing backpropagation in favour of unsupervised pretraining, is to use classic multilayer perceptrons with dropout and rectified linear units. For us, this breath of simplicity is a welcome change. Rectified linear is f(x) = max( 0, x ) . This makes backpropagation trivial: for x > 0, the derivative is one, else zero. Note that ReLU consists of two linear functions. But why stop at two? Let’s take max. out of three, or four, or five linear functions… And so maxout is a generalization of ReLU. It can approximate any convex function. Now backpropagation is easy and dropout prevents overfitting, so we can train a deep

7 fast ml-2013-10-28-How much data is enough?

Introduction: A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course . The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. In case of a linear model, degrees of freedom essentially equal data dimensionality (a number of columns). We find that thinking in terms of dimensionality vs number of examples is a convenient shortcut. The more powerful the model, the more it’s prone to overfitting and so the more examples you need. And of course the way of controlling this is through validation. Breaking the rules In practice you can get away with less than 10x, especially if your model is simple and uses regularization. In Kaggle competitions the ratio is often closer to 1:1, and sometimes dimensionality is far greater than a number of examples, depending on how you pre-process the data

8 fast ml-2013-10-09-Big data made easy

Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or

9 fast ml-2013-10-06-Pylearn2 in practice

Introduction: What do you get when you mix one part brilliant and one part daft? You get Pylearn2, a cutting edge neural networks library from Montreal that’s rather hard to use. Here we’ll show how to get through the daft part with your mental health relatively intact. Pylearn2 comes from the Lisa Lab in Montreal , led by Yoshua Bengio. Those are pretty smart guys and they concern themselves with deep learning. Recently they published a paper entitled Pylearn2: a machine learning research library [arxiv] . Here’s a quote: Pylearn2 is a machine learning research library - its users are researchers . This means (…) it is acceptable to assume that the user has some technical sophistication and knowledge of machine learning. The word research is possibly the most common word in the paper. There’s a reason for that: the library is certainly not production-ready. OK, it’s not that bad. There are only two difficult things: getting your data in getting predictions out What’

10 fast ml-2013-09-19-What you wanted to know about AUC

Introduction: AUC, or Area Under Curve, is a metric for binary classification. It’s probably the second most popular one, after accuracy. Unfortunately, it’s nowhere near as intuitive. That is, until you have read this article. Accuracy deals with ones and zeros, meaning you either got the class label right or you didn’t. But many classifiers are able to quantify their uncertainty about the answer by outputting a probability value. To compute accuracy from probabilities you need a threshold to decide when zero turns into one. The most natural threshold is of course 0.5. Let’s suppose you have a quirky classifier. It is able to get all the answers right, but it outputs 0.7 for negative examples and 0.9 for positive examples. Clearly, a threshold of 0.5 won’t get you far here. But 0.8 would be just perfect. That’s the whole point of using AUC - it considers all possible thresholds. Various thresholds result in different true positive/false positive rates. As you decrease the threshold, you get

11 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial

Introduction: Kaggle again. This time, solar energy prediction. We will show how to get data out of NetCDF4 files in Python and then beat the benchmark. The goal of this competition is to predict solar energy at Oklahoma Mesonet stations (red dots) from weather forecasts for GEFS points (blue dots): We’re getting a number of NetCDF files, each holding info about one variable, like expected temperature or precipitation. These variables have a few dimensions: time: the training set contains information about 5113 consecutive days. For each day there are five forecasts for different hours. location: location is described by latitude and longitude of GEFS points weather models: there are 11 forecasting models, called ensembles NetCDF4 tutorial The data is in NetCDF files, binary format apparently popular for storing scientific data. We will access it from Python using netcdf4-python . To use it, you will need to install HDF5 and NetCDF4 libraries first. If you’re on Wind

12 fast ml-2013-09-03-Our followers and who else they follow

Introduction: Recently we hit 400 followers mark on Twitter. To celebrate we decided to do some data mining on you , specifically to discover who our followers are and who else they follow. For your viewing pleasure we packaged the results nicely with Bootstrap. Here’s some data science in action. Our followers This table show our 20 most popular followers as measeared by their follower count. The occasional question marks stand for non-ASCII characters. Each link opens a new window.   Followers Screen name Name Description 8685 pankaj Pankaj Gupta I lead the Personalization and Recommender Systems group at Twitter. Founded two startups in the past.   5070 ogrisel Olivier Grisel Datageek, contributor to scikit-learn, works with Python / Java / Clojure / Pig, interested in Machine Learning, NLProc, {Big|Linked|Open} Data and braaains!   4582 thuske thuske & 4442 ram Ram Ravichandran So

13 fast ml-2013-08-23-A bag of words and a nice little network

Introduction: In this installment we will demonstrate how to turn text into numbers by a method known as a bag of words . We will also show how to train a simple neural network on resulting sparse data for binary classification. We will achieve the first feat with Python and scikit-learn, the second one with sparsenn . The example data comes from a Kaggle competition, specifically Stumbleupon Evergreen. The subject of the contest is to classify webpage content as either evergreen or not. The train set consist of about 7k examples. For each example we have a title and a body for a webpage and then about 20 numeric features describing the content (usually, because some tidbits are missing). We will use text only: extract body and turn it into a bag of words in libsvm format. In case you don’t know, libsvm is pretty popular for storing sparse data as text. It looks like this: 1 94:1 298:1 474:1 492:1 1213:1 1536:1 (...) First goes the label and then indexes of non-zero features. It’

14 fast ml-2013-08-12-Accelerometer Biometric Competition

Introduction: Can you recognize users of mobile devices from accelerometer data? It’s a rather non-standard problem (we’re dealing with time series here) and an interesting one. So we wrote some code and ran EC2 and then wrote more and ran EC2 again. After much computation we had our predictions, submitted them and achieved AUC = 0.83. Yay! But there’s a twist to this story. An interesting thing happened when we mistyped a command. Instead of computing AUC from a validation test set and raw predictions, we computed it using a training set: auc.py train_v.vw r.txt And got 0.86. Now, the training file has a very different number of lines from the test file, but our script only cares about how many lines are in a predictions file, so as long as the other file has at least that many, it’s OK. Turns out you can use this set or that set and get pretty much the same good result with each. That got us thinking. We computed a mean of predictions for each device. Here’s the plot: And here’

15 fast ml-2013-07-14-Running things on a GPU

Introduction: You’ve heard about running things on a graphics card, but have you tried it? All you need to taste the speed is a Nvidia card and some software. We run experiments using Cudamat and Theano in Python. GPUs differ from CPUs in that they are optimized for throughput instead of latency. Here’s a metaphor: when you play an online game, you want fast response times, or low latency. Otherwise you get lag. However when you’re downloading a movie, you don’t care about response times, you care about bandwidth - that’s throughput. Massive computations are similiar to downloading a movie in this respect. The setup We’ll be testing things on a platform with an Intel Dual Core CPU @3Ghz and either GeForce 9600 GT, an old card, or GeForce GTX 550 Ti, a more recent card. See the appendix for more info on GPU hardware. Software we’ll be using is Python. On CPU it employs one core*. That is OK in everyday use because while one core is working hard, you can comfortably do something else becau

16 fast ml-2013-07-09-Introducing phraug

Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach

17 fast ml-2013-07-05-Processing large files, line by line

Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re

18 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit

Introduction: Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are: a neural network with a single hidden layer automatic creation of polynomial, specifically quadratic and cubic, features N-grams We describe how to use them, providing examples from the Kaggle Amazon competition and for the kin8nm dataset. Neural network The original motivation for creating neural network code in VW was to win some Kaggle competitions using only vee-dub , and that goal becomes much more feasible once you have a strong non-linear learner. The network seems to be a classic multi-layer perceptron with one sigmoidal hidden layer. More interestingly, it has dropout. Unfortunately, in a few tries we haven’t had much luck with the dropout. Here’s an example of how to create a network with 10 hidden units: vw -d data.vw --nn 10 Quadratic and cubic features The idea of quadratic features is to create all possible combinations between original features, so that

19 fast ml-2013-06-01-Amazon aspires to automate access control

Introduction: This is about Amazon access control challenge at Kaggle. Either we’re getting smarter, or the competition is easy. Or maybe both. You can beat the benchmark quite easily and with AUC of 0.875 you’d be comfortably in the top twenty percent at the moment. We scored fourth in our first attempt - the model was quick to develop and back then there were fewer competitors. Traditionally we use Vowpal Wabbit . Just simple binary classification with the logistic loss function and 10 passes over the data. It seems to work pretty well even though the classes are very unbalanced: there’s only a handful of negatives when compared to positives. Apparently Amazon employees usually get the access they request, even though sometimes they are refused. Let’s look at the data. First a label and then a bunch of IDs. 1,39353,85475,117961,118300,123472,117905,117906,290919,117908 1,17183,1540,117961,118343,123125,118536,118536,308574,118539 1,36724,14457,118219,118220,117884,117879,267952

20 fast ml-2013-05-25-More on sparse filtering and the Black Box competition

Introduction: The Black Box challenge has just ended. We were thoroughly thrilled to learn that the winner, doubleshot , used sparse filtering, apparently following our cue. His score in terms of accuracy is 0.702, ours 0.645, and the best benchmark 0.525. We ranked 15th out of 217, a few places ahead of the Toronto team consisting of Charlie Tang and Nitish Srivastava . To their credit, Charlie has won the two remaining Challenges in Representation Learning . Not-so-deep learning The difference to our previous, beating-the-benchmark attempt is twofold: one layer instead of two for supervised learning, VW instead of a random forest Somewhat suprisingly, one layer works better than two. Even more surprisingly, with enough units you can get 0.634 using a linear model (Vowpal Wabbit, of course, One-Against-All). In our understanding, that’s the point of overcomplete representations*, which Stanford people seem to care much about. Recall The secret of the big guys and the pape

21 fast ml-2013-05-12-And deliver us from Weka

22 fast ml-2013-05-01-Deep learning made easy

23 fast ml-2013-04-17-Regression as classification

24 fast ml-2013-04-10-Gender discrimination

25 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

26 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

27 fast ml-2013-03-07-Choosing a machine learning algorithm

28 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

29 fast ml-2013-02-18-Predicting advertised salaries

30 fast ml-2013-02-07-The secret of the big guys

31 fast ml-2013-01-17-A very fast denoising autoencoder

32 fast ml-2013-01-14-Feature selection in practice

33 fast ml-2013-01-12-Intro to random forests

34 fast ml-2013-01-07-Machine learning courses online

35 fast ml-2013-01-04-Madelon: Spearmint's revenge