fast_ml fast_ml-2014 knowledge-graph by maker-knowledge-mining

fast_ml 2014 knowledge graph


similar blogs computed by tfidf model


similar blogs computed by lsi model


similar blogs computed by lda model


blogs list:

1 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

Introduction: On May 15th Yann LeCun answered “ask me anything” questions on Reddit . We hand-picked some of his thoughts and grouped them by topic for your enjoyment. Toronto, Montreal and New York All three groups are strong and complementary. Geoff (who spends more time at Google than in Toronto now) and Russ Salakhutdinov like RBMs and deep Boltzmann machines. I like the idea of Boltzmann machines (it’s a beautifully simple concept) but it doesn’t scale well. Also, I totally hate sampling. Yoshua and his colleagues have focused a lot on various unsupervised learning, including denoising auto-encoders, contracting auto-encoders. They are not allergic to sampling like I am. On the application side, they have worked on text, not so much on images. In our lab at NYU (Rob Fergus, David Sontag, me and our students and postdocs), we have been focusing on sparse auto-encoders for unsupervised learning. They have the advantage of scaling well. We have also worked on applications, mostly to v

2 fast ml-2014-05-08-Impute missing values with Amelia

Introduction: One of the ways to deal with missing values in data is to impute them. We use Amelia R package on The Analytics Edge competition data. Since one typically gets many imputed sets, we bag them with good results. So good that it seems we would have won the contest if not for a bug in our code. The competition Much to our surprise, we ranked 17th out of almost 1700 competitors - from the public leaderboard score we expected to be in top 10%, barely. The contest turned out to be one with huge overfitting possibilities, and people overfitted badly - some preliminary leaders ended up down the middle of the pack, while we soared up the ranks. But wait! There’s more. When preparing this article, we discovered a bug - apparently we used only 1980 points for training: points_in_test = 1980 train = data.iloc[:points_in_test,] # should be [:-points_in_test,] test = data.iloc[-points_in_test:,] If not for this little bug, we would have won, apparently. Imputing data A

3 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

Introduction: Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1 . This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers. UPDATE: Turns out that Pandas has get_dummies() function which does what we’re after. More on this in a while. We’ll use Pandas to load the data, do some cleaning and send it to Scikit-

4 fast ml-2014-04-21-Predicting happiness from demographics and poll answers

Introduction: This time we attempt to predict if poll responders said they’re happy or not. We also take a look at what features are the most important for that prediction. There is a private competition at Kaggle for students of The Analytics Edge MOOC. You can get the invitation link by signing up for the course and going to the week seven front page. It’s an entry-level, educational contest - there’s no prizes and the data is small. The competition is based on data from the Show of Hands , a mobile polling application for US residents. The link between the app and the MOOC is MIT: it’s an MIT’s class and MIT’s alumnus’ app. You get a few thousand examples. Each consists of some demographics: year of birth gender income household status education level party preference Plus a number of answers to yes/no poll questions from the Show of Hands. Here’s a sample: Are you good at math? Have you cried in the past 60 days? Do you brush your teeth two or more times ever

5 fast ml-2014-04-12-Deep learning these days

Introduction: It seems that quite a few people with interest in deep learning think of it in terms of unsupervised pre-training, autoencoders, stacked RBMs and deep belief networks. It’s easy to get into this groove by watching one of Geoff Hinton’s videos from a few years ago, where he bashes backpropagation in favour of unsupervised methods that are able to discover the structure in data by themselves, the same way as human brain does. Those videos, papers and tutorials linger. They were state of the art once, but things have changed since then. These days supervised learning is the king again. This has to do with the fact that you can look at data from many different angles and usually you’d prefer representation that is useful for the discriminative task at hand . Unsupervised learning will find some angle, but will it be the one you want? In case of the MNIST digits, sure. Otherwise probably not. Or maybe it will find a lot of angles while you only need one. Ladies and gentlemen, pleas

6 fast ml-2014-04-01-Exclusive Geoff Hinton interview

Introduction: Geoff Hinton is a living legend. He almost single-handedly invented backpropagation for training feed-forward neural networks. Despite in theory being universal function approximators, these networks turned out to be pretty much useless for more complex problems, like computer vision and speech recognition. Professor Hinton responded by creating deep networks and deep learning, an ultimate form of machine learning. Recently we’ve been fortunate to ask Geoff a few questions and have him answer them. Geoff, thanks so much for talking to us. You’ve had a long and fruitful career. What drives you these days? Well, after a man hits a certain age, his priorities change. Back in the 80s I was happy when I was able to train a network with eight hidden units. Now I can finally have thousands and possibly millions of them. So I guess the answer is scale. Apart from that, I like people at Google and I like making them a ton of money. They happen to pay me well, so it’s a win-win situ

7 fast ml-2014-03-31-If you use R, you may want RStudio

Introduction: RStudio is an IDE for R. It gives the language a bit of a slickness factor it so badly needs. The nice thing about the software, beside good looks, is that it integrates console, help pages, plots and editor (if you want it) in one place. For example, instead of switching for help to a web browser, you have it right next to the console. The same thing with plots: in place of the usual overlapping windows, all plots go to one pane where you can navigate back and forth between them with arrows. You can save a plot as an image or as a PDF. While saving as image you are presented with a preview and you get to choose the image format and size. That’s the kind of detail that shows how RStudio makes working with R better. You can have up to four panes on the screen: console source / data plots / help / files history and workspace Arrange them in any way you like. The source pane has a run button, so you can execute the current line of code in the console, or selec

8 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

Introduction: How to represent features for machine learning is an important business. For example, deep learning is all about finding good representations. What exactly they are depends on a task at hand. We investigate how to use available labels to obtain good representations. Motivation The paper that inspired us a while ago was Nonparametric Guidance of Autoencoder Representations using Label Information by Snoek, Adams and LaRochelle. It’s about autoencoders, but contains a greater idea: Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. (…) However, pure unsupervised learning (…) can find representations that may or may not be useful for the ultimate discriminative task. (…) In this work, we are interested in the discovery of latent features which can be later used as alternate representations of data for discriminative tasks. That is, we wish to find ways to extract statistical structu

9 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python

Introduction: We have already written a few articles about Pylearn2 . Today we’ll look at PyBrain. It is another Python neural networks library, and this is where similiarites end. They’re like day and night: Pylearn2 - Byzantinely complicated, PyBrain - simple. We attempted to train a regression model and succeeded at first take (more on this below). Try this with Pylearn2. While there are a few machine learning libraries out there, PyBrain aims to be a very easy-to-use modular library that can be used by entry-level students but still offers the flexibility and algorithms for state-of-the-art research. The library features classic perceptron as well as recurrent neural networks and other things, some of which, for example Evolino , would be hard to find elsewhere. On the downside, PyBrain feels unfinished, abandoned. It is no longer actively developed and the documentation is skimpy. There’s no modern gimmicks like dropout and rectified linear units - just good ol’ sigmoid and ta

10 fast ml-2014-02-20-Are stocks predictable?

Introduction: We’d like to be able to predict stock market. That seems like a nice way of making money. We’ll address the fundamental issue: can stocks be predicted in the short term, that is a few days ahead? There’s a technique that seeks to answer the question of predictability. It’s called Forecastable Component Analysis . Based on a new forecastability measure, ForeCA finds an optimal transformation to separate a multivariate time series into a forecastable and an orthogonal white noise space. The author, Georg M. Goerg*, implemented it in R package ForeCA . It might be useful in two ways: It can tell you how forecastable time series is. Given a multivariate time series, let’s say a portfolio of stocks, it can find forecastable components. The idea in the second point is similiar to PCA - ForeCA is a linear dimensionality reduction technique. The main difference is that the method explicitly addresses forecastability. It does so by considering an interplay between time and

11 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

Introduction: Out of 215 contestants, we placed 8th in the Cats and Dogs competition at Kaggle. The top ten finish gave us the master badge. The competition was about discerning the animals in images and here’s how we did it. We extracted the features using pre-trained deep convolutional networks, specifically decaf and OverFeat . Then we trained some classifiers on these features. The whole thing was inspired by Kyle Kastner’s decaf + pylearn2 combo and we expanded this idea. The classifiers were linear models from scikit-learn and a neural network from Pylearn2 . At the end we created a voting ensemble of the individual models. OverFeat features We touched on OverFeat in Classifying images with a pre-trained deep network . A better way to use it in this competition’s context is to extract the features from the layer before the classifier, as Pierre Sermanet suggested in the comments. Concretely, in the larger OverFeat model ( -l ) layer 24 is the softmax, at least in the

12 fast ml-2014-01-25-Why IPy: reasons for using IPython interactively

Introduction: IPython is known for the notebooks. But the first thing they list on their homepage is a “powerful interactive shell”. And that’s true - if you use Python interactively, you’ll dig IPython. Here are a few features that make a difference for us: command line history and autocomplete These are basic shell features you come to expect and rely upon after a first contact with a Unix shell. The standard Python interpreter feels somewhat impoverished without them. %paste Normally when you paste indented code, it won’t work. You can fix it with this magic command. other %magic commands Actually, you can skip the % : paste , run some_script.py , pwd , cd some_dir shell commands with ! For example !ls -las . Alas, aliases from your system shell don’t work. looks good Red and green add a nice touch, especially in a window with dark background. If you have a light background, use %colors LightBG Share your favourite features

13 fast ml-2014-01-20-How to get predictions from Pylearn2

Introduction: A while ago we’ve shown how to get predictions from a Pylearn2 model. It is a little tricky, partly because of splitting data into batches. If you’re able to fit your data in memory, you can strip the batch handling code and it becomes easier to see what’s going on. We exercise the concept to distinguish cats from dogs again, with superior results. Step by step You have a pickled model from Pylearn2. Let’s load it: from pylearn2.utils import serial model_path = 'model.pkl' model = serial.load( model_path ) Next, some Theano weirdness. Theano is a compiler for symbolic expressions and with these expressions we deal when predicting. We need to define expressions for X and Y: X = model.get_input_space().make_theano_batch() Y = model.fprop( X ) Mind you, these are not variables, but rather descriptions of how to get variables. Y is easy to understand: just feed the data to the model and forward-propagate. X is more of an idiom, the incantations above make sur

14 fast ml-2014-01-10-Classifying images with a pre-trained deep network

Introduction: Recently at least two research teams made their pre-trained deep convolutional networks available, so you can classify your images right away. We’ll see how to go about it, with data from the Cats & Dogs competition at Kaggle as an example. We’ll be using OverFeat , a classifier and feature extractor from the New York guys lead by Yann LeCun and Rob Fergus. The principal author, Pierre Sermanet, is currently first on the Dogs vs. Cats leaderboard . The other available implementation we know of comes from Berkeley. It’s called Caffe and is a successor to decaf . Yangqing Jia , the main author of these, is also near the top of the leaderboard. Both networks were trained on ImageNet , which is an image database organized according to the WordNet hierarchy . It was the ImageNet Large Scale Visual Recognition Challenge 2012 in which Alex Krizhevsky crushed the competition with his network. His error was 16%, the second best - 26%. Data The Kaggle competition featur