fast_ml fast_ml-2013 fast_ml-2013-19 knowledge-graph by maker-knowledge-mining

19 fast ml-2013-02-07-The secret of the big guys


meta infos for this blog

Source: html

Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. [sent-10, score-0.978]

2 And the idea is that when you map your features to a new space using K-means clustering, and learn a linear model on those new mapped features, you get pretty good results . [sent-13, score-0.774]

3 It means that when a distance is zero, a feature value is one. [sent-16, score-0.252]

4 01:10 y = exp( gamma * -x ) plot( x, y ) When you set gamma to a lower value, say 0. [sent-20, score-0.712]

5 5, the function will go to zero slower: The trick is to choose gamma so that the feature values are not all close to zero or all close to one. [sent-21, score-1.102]

6 We are tempted to think of cluster centers as something like support vectors. [sent-24, score-0.469]

7 It is important to use many clusters (think thousands), so the resulting representation becomes that many dimensional. [sent-26, score-0.245]

8 It seems that a more sophisticated model, like a random forest, can also be trained on mapped data, but it will be much slower. [sent-28, score-0.421]

9 If we use 1000 centers, the mapping will need roughly 500k * 1000 * 10 bytes, that is 5GB. [sent-32, score-0.243]

10 One solution would be to modify Sofia mapping code so that values close to zero become zeros, so that they are absent from the mapped file (it’s libsvm format). [sent-34, score-1.183]

11 01 cuts the file sizes in half and the results achieved in terms of an error metric are the same, if not slightly better. [sent-36, score-0.261]

12 For example, consider mapping to 1000 clusters vs 20 clusters: a 50 times difference. [sent-41, score-0.434]

13 In practice Here’s the procedure: run Sofia K-means to find cluster centers map the data to these centers using RBF learn a linear model on the mapped data Simple as that. [sent-42, score-1.423]

14 There are two hyperparams to optimize: a number of centers for K-means and gamma , called cluster_mapping_param in Sofia. [sent-43, score-0.719]

15 The latter is needed so that feature values will be nicely spread between zero and one, instead of being all close to zero or all close to one. [sent-44, score-0.746]

16 Here it’s important to note that we specify a training file in libsvm format, and an output file for the model. [sent-50, score-0.37]

17 01 In goes a model and a file to map, out comes a mapped file (still in libsvm format). [sent-60, score-0.759]

18 At this point, you can use software of your choice to build a linear model from the mapped file. [sent-61, score-0.536]

19 For Madelon , AUC score is slightly better than from random forest, slightly meaning 40 places on the leaderboard . [sent-69, score-0.294]

20 Similiar story with Madelon, only more settings tried out and best gamma is roughly 0. [sent-75, score-0.41]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('mapped', 0.367), ('gamma', 0.356), ('centers', 0.306), ('clusters', 0.245), ('mapping', 0.189), ('sofia', 0.184), ('cluster', 0.163), ('close', 0.162), ('rbf', 0.122), ('clustering', 0.122), ('zero', 0.121), ('map', 0.112), ('larger', 0.112), ('file', 0.103), ('libsvm', 0.103), ('feature', 0.096), ('slightly', 0.093), ('distance', 0.09), ('learner', 0.09), ('response', 0.09), ('linear', 0.086), ('values', 0.084), ('model', 0.083), ('madelon', 0.081), ('tuning', 0.075), ('forest', 0.07), ('method', 0.069), ('value', 0.066), ('results', 0.065), ('paper', 0.063), ('space', 0.061), ('specify', 0.061), ('hyperparams', 0.057), ('roughly', 0.054), ('meaning', 0.054), ('solution', 0.054), ('random', 0.054), ('format', 0.053), ('applicable', 0.051), ('disk', 0.051), ('exp', 0.051), ('radial', 0.051), ('clue', 0.051), ('compensate', 0.051), ('ability', 0.051), ('amounts', 0.051), ('cover', 0.051), ('curse', 0.051), ('extremely', 0.051), ('leaving', 0.051)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000005 19 fast ml-2013-02-07-The secret of the big guys

Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni

2 0.18929356 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

3 0.1330336 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition

Introduction: The Black Box challenge has just ended. We were thoroughly thrilled to learn that the winner, doubleshot , used sparse filtering, apparently following our cue. His score in terms of accuracy is 0.702, ours 0.645, and the best benchmark 0.525. We ranked 15th out of 217, a few places ahead of the Toronto team consisting of Charlie Tang and Nitish Srivastava . To their credit, Charlie has won the two remaining Challenges in Representation Learning . Not-so-deep learning The difference to our previous, beating-the-benchmark attempt is twofold: one layer instead of two for supervised learning, VW instead of a random forest Somewhat suprisingly, one layer works better than two. Even more surprisingly, with enough units you can get 0.634 using a linear model (Vowpal Wabbit, of course, One-Against-All). In our understanding, that’s the point of overcomplete representations*, which Stanford people seem to care much about. Recall The secret of the big guys and the pape

4 0.12113781 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

5 0.11622703 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

Introduction: How to represent features for machine learning is an important business. For example, deep learning is all about finding good representations. What exactly they are depends on a task at hand. We investigate how to use available labels to obtain good representations. Motivation The paper that inspired us a while ago was Nonparametric Guidance of Autoencoder Representations using Label Information by Snoek, Adams and LaRochelle. It’s about autoencoders, but contains a greater idea: Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. (…) However, pure unsupervised learning (…) can find representations that may or may not be useful for the ultimate discriminative task. (…) In this work, we are interested in the discovery of latent features which can be later used as alternate representations of data for discriminative tasks. That is, we wish to find ways to extract statistical structu

6 0.10925149 27 fast ml-2013-05-01-Deep learning made easy

7 0.10598504 20 fast ml-2013-02-18-Predicting advertised salaries

8 0.10172936 33 fast ml-2013-07-09-Introducing phraug

9 0.098353028 13 fast ml-2012-12-27-Spearmint with a random forest

10 0.0959545 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

11 0.094606578 14 fast ml-2013-01-04-Madelon: Spearmint's revenge

12 0.091429599 25 fast ml-2013-04-10-Gender discrimination

13 0.090968668 22 fast ml-2013-03-07-Choosing a machine learning algorithm

14 0.090738975 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

15 0.090366691 41 fast ml-2013-10-09-Big data made easy

16 0.090308689 18 fast ml-2013-01-17-A very fast denoising autoencoder

17 0.089675061 43 fast ml-2013-11-02-Maxing out the digits

18 0.084932178 32 fast ml-2013-07-05-Processing large files, line by line

19 0.080872275 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit

20 0.078551278 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.355), (1, -0.005), (2, -0.046), (3, -0.025), (4, 0.081), (5, 0.02), (6, -0.027), (7, -0.006), (8, 0.032), (9, -0.139), (10, 0.018), (11, -0.063), (12, 0.1), (13, -0.187), (14, 0.053), (15, -0.157), (16, 0.208), (17, 0.098), (18, 0.135), (19, -0.015), (20, -0.152), (21, -0.057), (22, 0.014), (23, 0.063), (24, -0.273), (25, 0.224), (26, 0.096), (27, -0.111), (28, 0.202), (29, -0.048), (30, 0.162), (31, -0.041), (32, 0.029), (33, -0.326), (34, -0.088), (35, 0.079), (36, -0.137), (37, -0.252), (38, 0.045), (39, 0.021), (40, 0.081), (41, 0.063), (42, -0.133), (43, -0.029), (44, -0.034), (45, 0.055), (46, 0.265), (47, -0.104), (48, -0.056), (49, 0.091)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96235281 19 fast ml-2013-02-07-The secret of the big guys

Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni

2 0.49158475 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

3 0.32340592 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition

Introduction: The Black Box challenge has just ended. We were thoroughly thrilled to learn that the winner, doubleshot , used sparse filtering, apparently following our cue. His score in terms of accuracy is 0.702, ours 0.645, and the best benchmark 0.525. We ranked 15th out of 217, a few places ahead of the Toronto team consisting of Charlie Tang and Nitish Srivastava . To their credit, Charlie has won the two remaining Challenges in Representation Learning . Not-so-deep learning The difference to our previous, beating-the-benchmark attempt is twofold: one layer instead of two for supervised learning, VW instead of a random forest Somewhat suprisingly, one layer works better than two. Even more surprisingly, with enough units you can get 0.634 using a linear model (Vowpal Wabbit, of course, One-Against-All). In our understanding, that’s the point of overcomplete representations*, which Stanford people seem to care much about. Recall The secret of the big guys and the pape

4 0.26028877 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

5 0.25321221 43 fast ml-2013-11-02-Maxing out the digits

Introduction: Recently we’ve been investigating the basics of Pylearn2 . Now it’s time for a more advanced example: a multilayer perceptron with dropout and maxout activation for the MNIST digits. Maxout explained If you’ve been following developments in deep learning, you know that Hinton’s most recent recommendation for supervised learning, after a few years of bashing backpropagation in favour of unsupervised pretraining, is to use classic multilayer perceptrons with dropout and rectified linear units. For us, this breath of simplicity is a welcome change. Rectified linear is f(x) = max( 0, x ) . This makes backpropagation trivial: for x > 0, the derivative is one, else zero. Note that ReLU consists of two linear functions. But why stop at two? Let’s take max. out of three, or four, or five linear functions… And so maxout is a generalization of ReLU. It can approximate any convex function. Now backpropagation is easy and dropout prevents overfitting, so we can train a deep

6 0.22834684 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

7 0.22214702 20 fast ml-2013-02-18-Predicting advertised salaries

8 0.21828017 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

9 0.20909537 22 fast ml-2013-03-07-Choosing a machine learning algorithm

10 0.20429824 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

11 0.20267174 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

12 0.2005294 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

13 0.19882521 27 fast ml-2013-05-01-Deep learning made easy

14 0.19826867 33 fast ml-2013-07-09-Introducing phraug

15 0.19806932 25 fast ml-2013-04-10-Gender discrimination

16 0.1925156 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

17 0.19093497 41 fast ml-2013-10-09-Big data made easy

18 0.17807838 46 fast ml-2013-12-07-13 NIPS papers that caught our eye

19 0.16281736 13 fast ml-2012-12-27-Spearmint with a random forest

20 0.16043064 18 fast ml-2013-01-17-A very fast denoising autoencoder


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(26, 0.042), (31, 0.091), (35, 0.03), (55, 0.042), (58, 0.021), (69, 0.177), (71, 0.062), (78, 0.382), (99, 0.053)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.96704096 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

Introduction: On May 15th Yann LeCun answered “ask me anything” questions on Reddit . We hand-picked some of his thoughts and grouped them by topic for your enjoyment. Toronto, Montreal and New York All three groups are strong and complementary. Geoff (who spends more time at Google than in Toronto now) and Russ Salakhutdinov like RBMs and deep Boltzmann machines. I like the idea of Boltzmann machines (it’s a beautifully simple concept) but it doesn’t scale well. Also, I totally hate sampling. Yoshua and his colleagues have focused a lot on various unsupervised learning, including denoising auto-encoders, contracting auto-encoders. They are not allergic to sampling like I am. On the application side, they have worked on text, not so much on images. In our lab at NYU (Rob Fergus, David Sontag, me and our students and postdocs), we have been focusing on sparse auto-encoders for unsupervised learning. They have the advantage of scaling well. We have also worked on applications, mostly to v

same-blog 2 0.89274418 19 fast ml-2013-02-07-The secret of the big guys

Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni

3 0.51957744 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

Introduction: How to represent features for machine learning is an important business. For example, deep learning is all about finding good representations. What exactly they are depends on a task at hand. We investigate how to use available labels to obtain good representations. Motivation The paper that inspired us a while ago was Nonparametric Guidance of Autoencoder Representations using Label Information by Snoek, Adams and LaRochelle. It’s about autoencoders, but contains a greater idea: Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. (…) However, pure unsupervised learning (…) can find representations that may or may not be useful for the ultimate discriminative task. (…) In this work, we are interested in the discovery of latent features which can be later used as alternate representations of data for discriminative tasks. That is, we wish to find ways to extract statistical structu

4 0.49142635 27 fast ml-2013-05-01-Deep learning made easy

Introduction: As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal. There are a couple benchmarks for this competition and the best one is unusually hard to beat 1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how. The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning: sparse filtering by Jiquan Ngiam et al. Actually, it’s not secret. It’s available at Github , and has one or two very appealling properties. Let us explain. The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications is in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for computers 2 . Geoff Hinton from Toronto talks about two ends

5 0.48992673 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

Introduction: The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples. We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set. What you wanted to know about Lasso and Ridge L1 and L2 are both ways of regularization sometimes called weight decay . Basically, we include parameter weights in a cost function. In effect, the model will try to minimize those weights by going “down the slope”. Example weights: in a linear model or in a neural network. L1 is known as Lasso and L2 is known as Ridge. These names may be confusing, because a chart of Lasso looks like a ridge and a

6 0.48940507 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

7 0.48890939 58 fast ml-2014-04-12-Deep learning these days

8 0.48647353 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

9 0.48643762 18 fast ml-2013-01-17-A very fast denoising autoencoder

10 0.48023808 40 fast ml-2013-10-06-Pylearn2 in practice

11 0.47774145 43 fast ml-2013-11-02-Maxing out the digits

12 0.4689846 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

13 0.45659262 13 fast ml-2012-12-27-Spearmint with a random forest

14 0.45503184 9 fast ml-2012-10-25-So you want to work for Facebook

15 0.4531422 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

16 0.44915596 17 fast ml-2013-01-14-Feature selection in practice

17 0.44805318 16 fast ml-2013-01-12-Intro to random forests

18 0.44732913 26 fast ml-2013-04-17-Regression as classification

19 0.43562675 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python

20 0.43020713 61 fast ml-2014-05-08-Impute missing values with Amelia