fast_ml fast_ml-2013 fast_ml-2013-27 knowledge-graph by maker-knowledge-mining

27 fast ml-2013-05-01-Deep learning made easy

meta infos for this blog

Source: html

Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. [sent-2, score-0.245]

2 The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. [sent-6, score-0.687]

3 The main idea of deep unsupervised learning, as we understand it, is feature extraction. [sent-10, score-0.255]

4 The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . [sent-12, score-0.271]

5 Geoff Hinton from Toronto talks about two ends of spectrum in machine learning : one is statistics and getting rid of noise, the other one - AI, or the things that humans are good at but computers are not. [sent-13, score-0.379]

6 Each layer is supposed to extract higher-level features, and these features are supposed to be more useful for the task at hand. [sent-16, score-0.308]

7 That’s the kind of thing sparse filtering learns from image patches. [sent-17, score-0.423]

8 So one layer might learn to recognize simple shapes from pixels. [sent-19, score-0.499]

9 Another layer could learn combining these shapes for more sophisticated features. [sent-20, score-0.275]

10 Sparse filtering attempts to overcome these difficulties. [sent-25, score-0.237]

11 The hyperparams to choose are: A number of layers A number of units in each layer Then you run the optimizer, and it finds the weights. [sent-29, score-0.244]

12 We trained a two layer sparse filtering structure. [sent-34, score-0.539]

13 m to reduce the number of iterations to make it run faster and the number of so called corrections if you run out of memory: By default, minFunc uses a large number of corrections in the L-BFGS method. [sent-50, score-0.377]

14 If you’d like to use extra data, you might want to convert it to . [sent-63, score-0.196]

15 Toronto vs Montreal To us, there seem to be three deep learning centers in academia: Stanford, Toronto, and Montreal. [sent-68, score-0.245]

16 ) Montreal group is lead by Yoshua Bengio, and they organize the said workshop at ICML 2013. [sent-73, score-0.261]

17 In response, Montreal released its invention: maxout , a natural companion to dropout designed to both facilitate optimization by dropout and improve the accuracy of dropout’s fast approximate model averaging technique . [sent-76, score-0.34]

18 Toronto uses Google Protocol Buffers as a configuration language in deepnet , Montreal uses YAML in their pylearn2 library. [sent-77, score-0.206]

19 Maybe that healthy competition helps deep learning advance so successfuly. [sent-80, score-0.333]

20 2 It’s worth noting that computers became better at recognizing hand-written digits than humans (we’re talking about MNIST dataset here). [sent-83, score-0.303]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('toronto', 0.255), ('filtering', 0.237), ('montreal', 0.214), ('minfunc', 0.193), ('layer', 0.18), ('dropout', 0.17), ('deep', 0.169), ('humans', 0.161), ('extra', 0.142), ('computers', 0.142), ('corr', 0.129), ('icml', 0.129), ('multimedia', 0.129), ('optw', 0.129), ('rcond', 0.129), ('sparse', 0.122), ('geoff', 0.118), ('corrections', 0.107), ('recognize', 0.107), ('warning', 0.107), ('workshop', 0.107), ('uses', 0.103), ('shapes', 0.095), ('competition', 0.088), ('andrew', 0.086), ('stanford', 0.086), ('group', 0.086), ('hinton', 0.086), ('unsupervised', 0.086), ('learning', 0.076), ('among', 0.073), ('ai', 0.073), ('success', 0.073), ('github', 0.073), ('google', 0.07), ('black', 0.068), ('lead', 0.068), ('lower', 0.064), ('kind', 0.064), ('layers', 0.064), ('supposed', 0.064), ('simple', 0.063), ('memory', 0.06), ('reduce', 0.06), ('might', 0.054), ('common', 0.054), ('step', 0.054), ('edit', 0.054), ('buffers', 0.054), ('protocol', 0.054)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 27 fast ml-2013-05-01-Deep learning made easy

2 0.2813096 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition

Introduction: The Black Box challenge has just ended. We were thoroughly thrilled to learn that the winner, doubleshot , used sparse filtering, apparently following our cue. His score in terms of accuracy is 0.702, ours 0.645, and the best benchmark 0.525. We ranked 15th out of 217, a few places ahead of the Toronto team consisting of Charlie Tang and Nitish Srivastava . To their credit, Charlie has won the two remaining Challenges in Representation Learning . Not-so-deep learning The difference to our previous, beating-the-benchmark attempt is twofold: one layer instead of two for supervised learning, VW instead of a random forest Somewhat suprisingly, one layer works better than two. Even more surprisingly, with enough units you can get 0.634 using a linear model (Vowpal Wabbit, of course, One-Against-All). In our understanding, that’s the point of overcomplete representations*, which Stanford people seem to care much about. Recall The secret of the big guys and the pape

3 0.19788156 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

Introduction: On May 15th Yann LeCun answered “ask me anything” questions on Reddit . We hand-picked some of his thoughts and grouped them by topic for your enjoyment. Toronto, Montreal and New York All three groups are strong and complementary. Geoff (who spends more time at Google than in Toronto now) and Russ Salakhutdinov like RBMs and deep Boltzmann machines. I like the idea of Boltzmann machines (it’s a beautifully simple concept) but it doesn’t scale well. Also, I totally hate sampling. Yoshua and his colleagues have focused a lot on various unsupervised learning, including denoising auto-encoders, contracting auto-encoders. They are not allergic to sampling like I am. On the application side, they have worked on text, not so much on images. In our lab at NYU (Rob Fergus, David Sontag, me and our students and postdocs), we have been focusing on sparse auto-encoders for unsupervised learning. They have the advantage of scaling well. We have also worked on applications, mostly to v

4 0.17775629 58 fast ml-2014-04-12-Deep learning these days

Introduction: It seems that quite a few people with interest in deep learning think of it in terms of unsupervised pre-training, autoencoders, stacked RBMs and deep belief networks. It’s easy to get into this groove by watching one of Geoff Hinton’s videos from a few years ago, where he bashes backpropagation in favour of unsupervised methods that are able to discover the structure in data by themselves, the same way as human brain does. Those videos, papers and tutorials linger. They were state of the art once, but things have changed since then. These days supervised learning is the king again. This has to do with the fact that you can look at data from many different angles and usually you’d prefer representation that is useful for the discriminative task at hand . Unsupervised learning will find some angle, but will it be the one you want? In case of the MNIST digits, sure. Otherwise probably not. Or maybe it will find a lot of angles while you only need one. Ladies and gentlemen, pleas

5 0.1431199 57 fast ml-2014-04-01-Exclusive Geoff Hinton interview

Introduction: Geoff Hinton is a living legend. He almost single-handedly invented backpropagation for training feed-forward neural networks. Despite in theory being universal function approximators, these networks turned out to be pretty much useless for more complex problems, like computer vision and speech recognition. Professor Hinton responded by creating deep networks and deep learning, an ultimate form of machine learning. Recently we’ve been fortunate to ask Geoff a few questions and have him answer them. Geoff, thanks so much for talking to us. You’ve had a long and fruitful career. What drives you these days? Well, after a man hits a certain age, his priorities change. Back in the 80s I was happy when I was able to train a network with eight hidden units. Now I can finally have thousands and possibly millions of them. So I guess the answer is scale. Apart from that, I like people at Google and I like making them a ton of money. They happen to pay me well, so it’s a win-win situ

6 0.13884594 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

7 0.13839172 43 fast ml-2013-11-02-Maxing out the digits

8 0.13447101 18 fast ml-2013-01-17-A very fast denoising autoencoder

9 0.12533799 46 fast ml-2013-12-07-13 NIPS papers that caught our eye

10 0.12270396 40 fast ml-2013-10-06-Pylearn2 in practice

11 0.12154675 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

12 0.10925149 19 fast ml-2013-02-07-The secret of the big guys

13 0.10481053 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

14 0.096142933 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

15 0.09162081 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

16 0.090008825 50 fast ml-2014-01-20-How to get predictions from Pylearn2

17 0.089695781 15 fast ml-2013-01-07-Machine learning courses online

18 0.089077637 2 fast ml-2012-08-27-Kaggle job recommendation challenge

19 0.074684799 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

20 0.073250815 22 fast ml-2013-03-07-Choosing a machine learning algorithm

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.379), (1, 0.245), (2, 0.245), (3, 0.1), (4, 0.091), (5, -0.151), (6, 0.194), (7, 0.116), (8, -0.119), (9, -0.163), (10, 0.152), (11, -0.09), (12, 0.041), (13, -0.037), (14, 0.036), (15, 0.019), (16, -0.125), (17, 0.033), (18, 0.184), (19, 0.026), (20, -0.061), (21, 0.105), (22, 0.065), (23, 0.015), (24, 0.018), (25, 0.034), (26, -0.003), (27, 0.114), (28, 0.109), (29, 0.033), (30, 0.047), (31, 0.161), (32, 0.081), (33, 0.003), (34, 0.032), (35, 0.087), (36, -0.047), (37, 0.037), (38, -0.055), (39, -0.016), (40, -0.099), (41, -0.04), (42, -0.076), (43, -0.02), (44, 0.096), (45, 0.018), (46, -0.27), (47, 0.039), (48, 0.173), (49, 0.002)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97169018 27 fast ml-2013-05-01-Deep learning made easy

2 0.76461053 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition

3 0.47189143 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

4 0.40955505 58 fast ml-2014-04-12-Deep learning these days

5 0.33359826 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

Introduction: Out of 215 contestants, we placed 8th in the Cats and Dogs competition at Kaggle. The top ten finish gave us the master badge. The competition was about discerning the animals in images and here’s how we did it. We extracted the features using pre-trained deep convolutional networks, specifically decaf and OverFeat . Then we trained some classifiers on these features. The whole thing was inspired by Kyle Kastner’s decaf + pylearn2 combo and we expanded this idea. The classifiers were linear models from scikit-learn and a neural network from Pylearn2 . At the end we created a voting ensemble of the individual models. OverFeat features We touched on OverFeat in Classifying images with a pre-trained deep network . A better way to use it in this competition’s context is to extract the features from the layer before the classifier, as Pierre Sermanet suggested in the comments. Concretely, in the larger OverFeat model ( -l ) layer 24 is the softmax, at least in the

6 0.30066052 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

7 0.29680875 40 fast ml-2013-10-06-Pylearn2 in practice

8 0.29116151 18 fast ml-2013-01-17-A very fast denoising autoencoder

9 0.24326527 46 fast ml-2013-12-07-13 NIPS papers that caught our eye

10 0.23319915 43 fast ml-2013-11-02-Maxing out the digits

11 0.22554331 19 fast ml-2013-02-07-The secret of the big guys

12 0.22149508 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

13 0.21598433 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

14 0.21326824 57 fast ml-2014-04-01-Exclusive Geoff Hinton interview

15 0.19906618 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

16 0.19843984 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

17 0.19673644 15 fast ml-2013-01-07-Machine learning courses online

18 0.18875548 50 fast ml-2014-01-20-How to get predictions from Pylearn2

19 0.18872137 25 fast ml-2013-04-10-Gender discrimination

20 0.18013684 13 fast ml-2012-12-27-Spearmint with a random forest

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(24, 0.012), (26, 0.024), (31, 0.088), (35, 0.023), (48, 0.014), (51, 0.02), (55, 0.013), (69, 0.615), (71, 0.034), (73, 0.012), (78, 0.037), (81, 0.01), (99, 0.034)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99843305 27 fast ml-2013-05-01-Deep learning made easy

2 0.98860502 14 fast ml-2013-01-04-Madelon: Spearmint's revenge

Introduction: Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them! The way to do this is to use a dataset that is known to produce lower error with high mtry values, namely previously mentioned Madelon from NIPS 2003 Feature Selection Challenge. Among 500 attributes, only 20 are informative, the rest are noise. That’s the reason why high mtry is good here: you have to consider a lot of features to find a meaningful one. The dataset consists of a train, validation and test parts, with labels being available for train and validation. We will further split the training set into our train and validation sets, and use the original validation set as a test set to evaluate final results of parameter tuning. As an error measure we use Area Under Curve , or AUC, which was

3 0.98641056 1 fast ml-2012-08-09-What you wanted to know about Mean Average Precision

Introduction: Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. They will evaluate the results using mean average precision, or MAP, metric. Specifically MAP@x - this means they ask us to recommend x items for each user. So what is this MAP? First, we will get M out of the way. MAP is just an average of APs, or average precision, for all users. In other words, we take the mean for Average Precision, hence Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is MAP. So now, what is AP, or average precision? It may be that we don’t really need to know. But we probably need to know this: we can recommend at most x items for each user it pays to submit all x recommendations, because we are not penalized for bad guesses order matters, so it’s better to submit more certain recommendations fi

4 0.96299994 13 fast ml-2012-12-27-Spearmint with a random forest

Introduction: Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees ( ntrees ) and a number of candidate features at each split ( mtry ). Here’s some code . We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties. This is a regression* task, as ratings are in (0,10) range. We will split the data 80/10/10 into train, validation and test set, and use the first two to establish optimal hyperparams and then predict on the test set. As an error measure we will use RMSE. At first, we will try ntrees between 10 and 200 and mtry between 3 and 11 (there’s eleven features total, so that’s the upper bound). Here are the results of two Spearmint runs with 71 and 95 tries respectively. Colors denote a validation error value: green : RMSE < 0.57 blue : RMSE < 0.58 black : RMSE >= 0.58 Turns out that some diffe

5 0.87628585 43 fast ml-2013-11-02-Maxing out the digits

Introduction: Recently we’ve been investigating the basics of Pylearn2 . Now it’s time for a more advanced example: a multilayer perceptron with dropout and maxout activation for the MNIST digits. Maxout explained If you’ve been following developments in deep learning, you know that Hinton’s most recent recommendation for supervised learning, after a few years of bashing backpropagation in favour of unsupervised pretraining, is to use classic multilayer perceptrons with dropout and rectified linear units. For us, this breath of simplicity is a welcome change. Rectified linear is f(x) = max( 0, x ) . This makes backpropagation trivial: for x > 0, the derivative is one, else zero. Note that ReLU consists of two linear functions. But why stop at two? Let’s take max. out of three, or four, or five linear functions… And so maxout is a generalization of ReLU. It can approximate any convex function. Now backpropagation is easy and dropout prevents overfitting, so we can train a deep

6 0.85691702 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

7 0.85657132 17 fast ml-2013-01-14-Feature selection in practice

8 0.85063875 18 fast ml-2013-01-17-A very fast denoising autoencoder

9 0.84189618 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

10 0.80123597 19 fast ml-2013-02-07-The secret of the big guys

11 0.79754853 35 fast ml-2013-08-12-Accelerometer Biometric Competition

12 0.78563398 20 fast ml-2013-02-18-Predicting advertised salaries

13 0.77817148 9 fast ml-2012-10-25-So you want to work for Facebook

14 0.77667564 40 fast ml-2013-10-06-Pylearn2 in practice

15 0.73364925 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

16 0.73145157 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

17 0.71757936 26 fast ml-2013-04-17-Regression as classification

18 0.71729898 61 fast ml-2014-05-08-Impute missing values with Amelia

19 0.70617217 25 fast ml-2013-04-10-Gender discrimination

20 0.70266867 8 fast ml-2012-10-15-Merck challenge