blog-mining fast_ml knowledge-graph by maker-knowledge-mining
the latest blogs:
1 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
Introduction: On May 15th Yann LeCun answered “ask me anything” questions on Reddit . We hand-picked some of his thoughts and grouped them by topic for your enjoyment. Toronto, Montreal and New York All three groups are strong and complementary. Geoff (who spends more time at Google than in Toronto now) and Russ Salakhutdinov like RBMs and deep Boltzmann machines. I like the idea of Boltzmann machines (it’s a beautifully simple concept) but it doesn’t scale well. Also, I totally hate sampling. Yoshua and his colleagues have focused a lot on various unsupervised learning, including denoising auto-encoders, contracting auto-encoders. They are not allergic to sampling like I am. On the application side, they have worked on text, not so much on images. In our lab at NYU (Rob Fergus, David Sontag, me and our students and postdocs), we have been focusing on sparse auto-encoders for unsupervised learning. They have the advantage of scaling well. We have also worked on applications, mostly to v
2 fast ml-2014-05-08-Impute missing values with Amelia
Introduction: One of the ways to deal with missing values in data is to impute them. We use Amelia R package on The Analytics Edge competition data. Since one typically gets many imputed sets, we bag them with good results. So good that it seems we would have won the contest if not for a bug in our code. The competition Much to our surprise, we ranked 17th out of almost 1700 competitors - from the public leaderboard score we expected to be in top 10%, barely. The contest turned out to be one with huge overfitting possibilities, and people overfitted badly - some preliminary leaders ended up down the middle of the pack, while we soared up the ranks. But wait! There’s more. When preparing this article, we discovered a bug - apparently we used only 1980 points for training: points_in_test = 1980 train = data.iloc[:points_in_test,] # should be [:-points_in_test,] test = data.iloc[-points_in_test:,] If not for this little bug, we would have won, apparently. Imputing data A
3 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
Introduction: Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1 . This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers. UPDATE: Turns out that Pandas has get_dummies() function which does what we’re after. More on this in a while. We’ll use Pandas to load the data, do some cleaning and send it to Scikit-
4 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
Introduction: This time we attempt to predict if poll responders said they’re happy or not. We also take a look at what features are the most important for that prediction. There is a private competition at Kaggle for students of The Analytics Edge MOOC. You can get the invitation link by signing up for the course and going to the week seven front page. It’s an entry-level, educational contest - there’s no prizes and the data is small. The competition is based on data from the Show of Hands , a mobile polling application for US residents. The link between the app and the MOOC is MIT: it’s an MIT’s class and MIT’s alumnus’ app. You get a few thousand examples. Each consists of some demographics: year of birth gender income household status education level party preference Plus a number of answers to yes/no poll questions from the Show of Hands. Here’s a sample: Are you good at math? Have you cried in the past 60 days? Do you brush your teeth two or more times ever
5 fast ml-2014-04-12-Deep learning these days
Introduction: It seems that quite a few people with interest in deep learning think of it in terms of unsupervised pre-training, autoencoders, stacked RBMs and deep belief networks. It’s easy to get into this groove by watching one of Geoff Hinton’s videos from a few years ago, where he bashes backpropagation in favour of unsupervised methods that are able to discover the structure in data by themselves, the same way as human brain does. Those videos, papers and tutorials linger. They were state of the art once, but things have changed since then. These days supervised learning is the king again. This has to do with the fact that you can look at data from many different angles and usually you’d prefer representation that is useful for the discriminative task at hand . Unsupervised learning will find some angle, but will it be the one you want? In case of the MNIST digits, sure. Otherwise probably not. Or maybe it will find a lot of angles while you only need one. Ladies and gentlemen, pleas
6 fast ml-2014-04-01-Exclusive Geoff Hinton interview
Introduction: Geoff Hinton is a living legend. He almost single-handedly invented backpropagation for training feed-forward neural networks. Despite in theory being universal function approximators, these networks turned out to be pretty much useless for more complex problems, like computer vision and speech recognition. Professor Hinton responded by creating deep networks and deep learning, an ultimate form of machine learning. Recently we’ve been fortunate to ask Geoff a few questions and have him answer them. Geoff, thanks so much for talking to us. You’ve had a long and fruitful career. What drives you these days? Well, after a man hits a certain age, his priorities change. Back in the 80s I was happy when I was able to train a network with eight hidden units. Now I can finally have thousands and possibly millions of them. So I guess the answer is scale. Apart from that, I like people at Google and I like making them a ton of money. They happen to pay me well, so it’s a win-win situ
7 fast ml-2014-03-31-If you use R, you may want RStudio
Introduction: RStudio is an IDE for R. It gives the language a bit of a slickness factor it so badly needs. The nice thing about the software, beside good looks, is that it integrates console, help pages, plots and editor (if you want it) in one place. For example, instead of switching for help to a web browser, you have it right next to the console. The same thing with plots: in place of the usual overlapping windows, all plots go to one pane where you can navigate back and forth between them with arrows. You can save a plot as an image or as a PDF. While saving as image you are presented with a preview and you get to choose the image format and size. That’s the kind of detail that shows how RStudio makes working with R better. You can have up to four panes on the screen: console source / data plots / help / files history and workspace Arrange them in any way you like. The source pane has a run button, so you can execute the current line of code in the console, or selec
Introduction: How to represent features for machine learning is an important business. For example, deep learning is all about finding good representations. What exactly they are depends on a task at hand. We investigate how to use available labels to obtain good representations. Motivation The paper that inspired us a while ago was Nonparametric Guidance of Autoencoder Representations using Label Information by Snoek, Adams and LaRochelle. It’s about autoencoders, but contains a greater idea: Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. (…) However, pure unsupervised learning (…) can find representations that may or may not be useful for the ultimate discriminative task. (…) In this work, we are interested in the discovery of latent features which can be later used as alternate representations of data for discriminative tasks. That is, we wish to find ways to extract statistical structu
9 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
Introduction: We have already written a few articles about Pylearn2 . Today we’ll look at PyBrain. It is another Python neural networks library, and this is where similiarites end. They’re like day and night: Pylearn2 - Byzantinely complicated, PyBrain - simple. We attempted to train a regression model and succeeded at first take (more on this below). Try this with Pylearn2. While there are a few machine learning libraries out there, PyBrain aims to be a very easy-to-use modular library that can be used by entry-level students but still offers the flexibility and algorithms for state-of-the-art research. The library features classic perceptron as well as recurrent neural networks and other things, some of which, for example Evolino , would be hard to find elsewhere. On the downside, PyBrain feels unfinished, abandoned. It is no longer actively developed and the documentation is skimpy. There’s no modern gimmicks like dropout and rectified linear units - just good ol’ sigmoid and ta
10 fast ml-2014-02-20-Are stocks predictable?
Introduction: We’d like to be able to predict stock market. That seems like a nice way of making money. We’ll address the fundamental issue: can stocks be predicted in the short term, that is a few days ahead? There’s a technique that seeks to answer the question of predictability. It’s called Forecastable Component Analysis . Based on a new forecastability measure, ForeCA finds an optimal transformation to separate a multivariate time series into a forecastable and an orthogonal white noise space. The author, Georg M. Goerg*, implemented it in R package ForeCA . It might be useful in two ways: It can tell you how forecastable time series is. Given a multivariate time series, let’s say a portfolio of stocks, it can find forecastable components. The idea in the second point is similiar to PCA - ForeCA is a linear dimensionality reduction technique. The main difference is that the method explicitly addresses forecastability. It does so by considering an interplay between time and