fast_ml fast_ml-2014 fast_ml-2014-55 knowledge-graph by maker-knowledge-mining

55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction


meta infos for this blog

Source: html

Introduction: How to represent features for machine learning is an important business. For example, deep learning is all about finding good representations. What exactly they are depends on a task at hand. We investigate how to use available labels to obtain good representations. Motivation The paper that inspired us a while ago was Nonparametric Guidance of Autoencoder Representations using Label Information by Snoek, Adams and LaRochelle. It’s about autoencoders, but contains a greater idea: Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. (…) However, pure unsupervised learning (…) can find representations that may or may not be useful for the ultimate discriminative task. (…) In this work, we are interested in the discovery of latent features which can be later used as alternate representations of data for discriminative tasks. That is, we wish to find ways to extract statistical structu


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 (…) However, pure unsupervised learning (…) can find representations that may or may not be useful for the ultimate discriminative task. [sent-7, score-0.526]

2 (…) In this work, we are interested in the discovery of latent features which can be later used as alternate representations of data for discriminative tasks. [sent-8, score-0.41]

3 This is undesirable because it potentially prevents us from discovering informative representations for the more sophisticated nonlinear classifiers that we might wish to use later. [sent-14, score-0.298]

4 In practice this may or may not be a problem - Paul Mineiro thinks that features good for linear classifiers are good for non-linear ones too : engineer features that are good for your linear model, and then when you run out of steam, try to add a few hidden units. [sent-15, score-0.484]

5 In this paper we investigate scalable techniques for inducing discriminative features by taking advantage of simple second order structure in the data. [sent-17, score-0.486]

6 Distance Many machine learning methods rely on some measure of distance between points, usually Euclidian distance. [sent-21, score-0.461]

7 These methods are notably nearest neighbours and kernel methods. [sent-22, score-0.457]

8 Euclidian distance depends on a scale of each feature. [sent-23, score-0.379]

9 When scaling, we multiply the design matrix X by a diagonal matrix . [sent-29, score-1.034]

10 In special case when that diagonal matrix is an identity matrix , we get back X: X * I = X The entries on the main diagonal are the factors by which to scale each column. [sent-30, score-1.131]

11 This illustrates the fact that there may be better options than scaling each axis to a unit standard deviation, at least when using methods concerned with distance. [sent-32, score-0.412]

12 Also, what if we extended the idea of scaling by adding some non-diagonal nonzero entries to the transformation matrix? [sent-33, score-0.418]

13 Metric learning For the purpose of this article we’ll define metric learning as learning the matrix M with which we can linearly transform the design matrix as a whole. [sent-34, score-1.265]

14 If M is square, we can transform a distance between two points. [sent-35, score-0.316]

15 The idea is to warp it so that the points with the same (or similiar - in regression) labels get closer together and those with different labels get farther apart. [sent-36, score-0.465]

16 We can warp a Euclidian distance by sticking M in the middle of the inner product: x1 - x2 = [1xD] [1xD] * [DxD] * [Dx1] = [1x1] Dx1 is just the x1 - x2 transposed. [sent-37, score-0.353]

17 We can also multiply the original matrix of features by M. [sent-38, score-0.521]

18 The resulting matrix will have the same number of rows. [sent-39, score-0.31]

19 If it is square, dimensionality of the transformed design matrix stays the same: [NxD] * [DxD] = [NxD] If M is not square, we achieve either supervised dimensionality reduction or get an overcomplete representation: [NxD] * [Dx? [sent-44, score-0.521]

20 If you have a big dataset, however, you could select a smaller representative subset of points (for example by clustering) to learn a transformation matrix, then apply the transformation to all the points. [sent-53, score-0.33]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('matrix', 0.31), ('distance', 0.243), ('lmnn', 0.221), ('discriminative', 0.184), ('diagonal', 0.166), ('dxd', 0.166), ('euclidian', 0.166), ('mlkr', 0.166), ('nxd', 0.166), ('square', 0.147), ('scaling', 0.146), ('methods', 0.139), ('metric', 0.139), ('neighbours', 0.138), ('design', 0.138), ('representations', 0.125), ('nearest', 0.122), ('dx', 0.11), ('entries', 0.11), ('itml', 0.11), ('multiply', 0.11), ('warp', 0.11), ('transformation', 0.11), ('points', 0.11), ('features', 0.101), ('wish', 0.092), ('paul', 0.092), ('standardizing', 0.092), ('demo', 0.092), ('margin', 0.092), ('investigate', 0.081), ('classifiers', 0.081), ('learning', 0.079), ('stays', 0.073), ('transform', 0.073), ('elements', 0.073), ('scale', 0.069), ('may', 0.069), ('together', 0.067), ('depends', 0.067), ('pca', 0.067), ('labels', 0.063), ('hidden', 0.063), ('structure', 0.063), ('kernel', 0.058), ('axis', 0.058), ('article', 0.058), ('paper', 0.057), ('classifier', 0.055), ('idea', 0.052)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999982 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

Introduction: How to represent features for machine learning is an important business. For example, deep learning is all about finding good representations. What exactly they are depends on a task at hand. We investigate how to use available labels to obtain good representations. Motivation The paper that inspired us a while ago was Nonparametric Guidance of Autoencoder Representations using Label Information by Snoek, Adams and LaRochelle. It’s about autoencoders, but contains a greater idea: Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. (…) However, pure unsupervised learning (…) can find representations that may or may not be useful for the ultimate discriminative task. (…) In this work, we are interested in the discovery of latent features which can be later used as alternate representations of data for discriminative tasks. That is, we wish to find ways to extract statistical structu

2 0.13904783 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

Introduction: Last time we explored dimensionality reduction in practice using Gensim’s LSI and LDA. Now, having spent some time researching the subject matter, we will give an overview of other options. UPDATE : We now consider the topic quite irrelevant, because sparse high-dimensional data is precisely where linear models shine. See Amazon aspires to automate access control , Predicting advertised salaries and Predicting closed questions on Stack Overflow . And the few most popular methods are: LSI/LSA - a multinomial PCA LDA - Latent Dirichlet Allocation matrix factorization, in particular non-negative variants: NMF ICA, or Independent Components Analysis mixtures of Bernoullis stacked RBMs correlated topic models, an extension of LDA We tried the first two before. As regards matrix factorization, you do the same stuff as with movie recommendations (think Netflix challenge). The difference is, now all the matrix elements are known and we are only interested in

3 0.11860725 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

Introduction: On May 15th Yann LeCun answered “ask me anything” questions on Reddit . We hand-picked some of his thoughts and grouped them by topic for your enjoyment. Toronto, Montreal and New York All three groups are strong and complementary. Geoff (who spends more time at Google than in Toronto now) and Russ Salakhutdinov like RBMs and deep Boltzmann machines. I like the idea of Boltzmann machines (it’s a beautifully simple concept) but it doesn’t scale well. Also, I totally hate sampling. Yoshua and his colleagues have focused a lot on various unsupervised learning, including denoising auto-encoders, contracting auto-encoders. They are not allergic to sampling like I am. On the application side, they have worked on text, not so much on images. In our lab at NYU (Rob Fergus, David Sontag, me and our students and postdocs), we have been focusing on sparse auto-encoders for unsupervised learning. They have the advantage of scaling well. We have also worked on applications, mostly to v

4 0.11622703 19 fast ml-2013-02-07-The secret of the big guys

Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni

5 0.11550124 46 fast ml-2013-12-07-13 NIPS papers that caught our eye

Introduction: Recently Rob Zinkov published his selection of interesting-looking NIPS papers . Inspired by this, we list some more. Rob seems to like Bayesian stuff, we’re more into neural networks. If you feel like browsing, Andrej Karpathy has a page with all NIPS 2013 papers . They are categorized by topics discovered by running LDA. When you see an interesting paper, you can discover ones ranked similiar by TF-IDF. Here’s what we found. Understanding Dropout Pierre Baldi, Peter J. Sadowski Dropout is a relatively new algorithm for training neural networks which relies on stochastically dropping out neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characte

6 0.10789977 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

7 0.097227529 16 fast ml-2013-01-12-Intro to random forests

8 0.09162081 27 fast ml-2013-05-01-Deep learning made easy

9 0.091423213 18 fast ml-2013-01-17-A very fast denoising autoencoder

10 0.088358887 22 fast ml-2013-03-07-Choosing a machine learning algorithm

11 0.08696495 58 fast ml-2014-04-12-Deep learning these days

12 0.086584941 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition

13 0.082037494 13 fast ml-2012-12-27-Spearmint with a random forest

14 0.077645607 20 fast ml-2013-02-18-Predicting advertised salaries

15 0.077188008 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

16 0.072410099 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit

17 0.066312253 17 fast ml-2013-01-14-Feature selection in practice

18 0.065891765 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

19 0.065416977 41 fast ml-2013-10-09-Big data made easy

20 0.064485945 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.295), (1, 0.114), (2, 0.096), (3, 0.06), (4, 0.113), (5, 0.047), (6, -0.059), (7, -0.219), (8, -0.027), (9, -0.019), (10, 0.032), (11, -0.121), (12, 0.045), (13, -0.101), (14, -0.058), (15, -0.089), (16, 0.191), (17, -0.028), (18, -0.085), (19, 0.091), (20, 0.226), (21, -0.108), (22, -0.164), (23, -0.004), (24, -0.273), (25, 0.109), (26, 0.05), (27, -0.159), (28, 0.098), (29, -0.183), (30, 0.0), (31, -0.031), (32, -0.092), (33, 0.009), (34, -0.161), (35, 0.055), (36, 0.313), (37, 0.381), (38, 0.098), (39, -0.092), (40, -0.028), (41, 0.23), (42, -0.005), (43, 0.104), (44, -0.121), (45, -0.186), (46, -0.159), (47, -0.068), (48, 0.01), (49, -0.064)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97472018 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

Introduction: How to represent features for machine learning is an important business. For example, deep learning is all about finding good representations. What exactly they are depends on a task at hand. We investigate how to use available labels to obtain good representations. Motivation The paper that inspired us a while ago was Nonparametric Guidance of Autoencoder Representations using Label Information by Snoek, Adams and LaRochelle. It’s about autoencoders, but contains a greater idea: Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. (…) However, pure unsupervised learning (…) can find representations that may or may not be useful for the ultimate discriminative task. (…) In this work, we are interested in the discovery of latent features which can be later used as alternate representations of data for discriminative tasks. That is, we wish to find ways to extract statistical structu

2 0.32636341 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

Introduction: Last time we explored dimensionality reduction in practice using Gensim’s LSI and LDA. Now, having spent some time researching the subject matter, we will give an overview of other options. UPDATE : We now consider the topic quite irrelevant, because sparse high-dimensional data is precisely where linear models shine. See Amazon aspires to automate access control , Predicting advertised salaries and Predicting closed questions on Stack Overflow . And the few most popular methods are: LSI/LSA - a multinomial PCA LDA - Latent Dirichlet Allocation matrix factorization, in particular non-negative variants: NMF ICA, or Independent Components Analysis mixtures of Bernoullis stacked RBMs correlated topic models, an extension of LDA We tried the first two before. As regards matrix factorization, you do the same stuff as with movie recommendations (think Netflix challenge). The difference is, now all the matrix elements are known and we are only interested in

3 0.25933364 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

Introduction: On May 15th Yann LeCun answered “ask me anything” questions on Reddit . We hand-picked some of his thoughts and grouped them by topic for your enjoyment. Toronto, Montreal and New York All three groups are strong and complementary. Geoff (who spends more time at Google than in Toronto now) and Russ Salakhutdinov like RBMs and deep Boltzmann machines. I like the idea of Boltzmann machines (it’s a beautifully simple concept) but it doesn’t scale well. Also, I totally hate sampling. Yoshua and his colleagues have focused a lot on various unsupervised learning, including denoising auto-encoders, contracting auto-encoders. They are not allergic to sampling like I am. On the application side, they have worked on text, not so much on images. In our lab at NYU (Rob Fergus, David Sontag, me and our students and postdocs), we have been focusing on sparse auto-encoders for unsupervised learning. They have the advantage of scaling well. We have also worked on applications, mostly to v

4 0.25827193 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

Introduction: Out of 215 contestants, we placed 8th in the Cats and Dogs competition at Kaggle. The top ten finish gave us the master badge. The competition was about discerning the animals in images and here’s how we did it. We extracted the features using pre-trained deep convolutional networks, specifically decaf and OverFeat . Then we trained some classifiers on these features. The whole thing was inspired by Kyle Kastner’s decaf + pylearn2 combo and we expanded this idea. The classifiers were linear models from scikit-learn and a neural network from Pylearn2 . At the end we created a voting ensemble of the individual models. OverFeat features We touched on OverFeat in Classifying images with a pre-trained deep network . A better way to use it in this competition’s context is to extract the features from the layer before the classifier, as Pierre Sermanet suggested in the comments. Concretely, in the larger OverFeat model ( -l ) layer 24 is the softmax, at least in the

5 0.22140218 19 fast ml-2013-02-07-The secret of the big guys

Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni

6 0.18702772 46 fast ml-2013-12-07-13 NIPS papers that caught our eye

7 0.18152307 13 fast ml-2012-12-27-Spearmint with a random forest

8 0.17799643 27 fast ml-2013-05-01-Deep learning made easy

9 0.17170607 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit

10 0.17054738 18 fast ml-2013-01-17-A very fast denoising autoencoder

11 0.17022213 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

12 0.16835904 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition

13 0.16681747 58 fast ml-2014-04-12-Deep learning these days

14 0.1575402 15 fast ml-2013-01-07-Machine learning courses online

15 0.15252416 16 fast ml-2013-01-12-Intro to random forests

16 0.14001654 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

17 0.13655721 17 fast ml-2013-01-14-Feature selection in practice

18 0.13250327 42 fast ml-2013-10-28-How much data is enough?

19 0.12836511 22 fast ml-2013-03-07-Choosing a machine learning algorithm

20 0.1281558 20 fast ml-2013-02-18-Predicting advertised salaries


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(6, 0.016), (26, 0.05), (31, 0.046), (35, 0.06), (47, 0.391), (48, 0.016), (55, 0.024), (58, 0.017), (69, 0.123), (71, 0.057), (78, 0.053), (79, 0.022), (99, 0.049)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.85863435 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

Introduction: How to represent features for machine learning is an important business. For example, deep learning is all about finding good representations. What exactly they are depends on a task at hand. We investigate how to use available labels to obtain good representations. Motivation The paper that inspired us a while ago was Nonparametric Guidance of Autoencoder Representations using Label Information by Snoek, Adams and LaRochelle. It’s about autoencoders, but contains a greater idea: Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. (…) However, pure unsupervised learning (…) can find representations that may or may not be useful for the ultimate discriminative task. (…) In this work, we are interested in the discovery of latent features which can be later used as alternate representations of data for discriminative tasks. That is, we wish to find ways to extract statistical structu

2 0.38582599 19 fast ml-2013-02-07-The secret of the big guys

Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni

3 0.35410103 9 fast ml-2012-10-25-So you want to work for Facebook

Introduction: Good news, everyone! There’s a new contest on Kaggle - Facebook is looking for talent . They won’t pay, but just might interview. This post is in a way a bonus for active readers because most visitors of fastml.com originally come from Kaggle forums. For this competition the forums are disabled to encourage own work . To honor this, we won’t publish any code. But own work doesn’t mean original work , and we wouldn’t want to reinvent the wheel, would we? The contest differs substantially from a Kaggle stereotype, if there is such a thing, in three major ways: there’s no money prizes, as mentioned above it’s not a real world problem, but rather an assignment to screen job candidates (this has important consequences, described below) it’s not a typical machine learning project, but rather a broader AI exercise You are given a graph of the internet, actually a snapshot of the graph for each of 15 time steps. You are also given a bunch of paths in this graph, which a

4 0.35037515 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

5 0.34751359 18 fast ml-2013-01-17-A very fast denoising autoencoder

Introduction: Once upon a time we were browsing machine learning papers and software. We were interested in autoencoders and found a rather unusual one. It was called marginalized Stacked Denoising Autoencoder and the author claimed that it preserves the strong feature learning capacity of Stacked Denoising Autoencoders, but is orders of magnitudes faster. We like all things fast, so we were hooked. About autoencoders Wikipedia says that an autoencoder is an artificial neural network and its aim is to learn a compressed representation for a set of data. This means it is being used for dimensionality reduction . In other words, an autoencoder is a neural network meant to replicate the input. It would be trivial with a big enough number of units in a hidden layer: the network would just find an identity mapping. Hence dimensionality reduction: a hidden layer size is typically smaller than input layer. mSDA is a curious specimen: it is not a neural network and it doesn’t reduce dimension

6 0.34330717 16 fast ml-2013-01-12-Intro to random forests

7 0.3414 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

8 0.33767247 40 fast ml-2013-10-06-Pylearn2 in practice

9 0.33600444 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

10 0.33582374 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

11 0.33469868 13 fast ml-2012-12-27-Spearmint with a random forest

12 0.33235747 27 fast ml-2013-05-01-Deep learning made easy

13 0.3307648 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA

14 0.32950291 17 fast ml-2013-01-14-Feature selection in practice

15 0.32721189 43 fast ml-2013-11-02-Maxing out the digits

16 0.32584426 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

17 0.32031569 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

18 0.31749588 49 fast ml-2014-01-10-Classifying images with a pre-trained deep network

19 0.31658384 20 fast ml-2013-02-18-Predicting advertised salaries

20 0.31536022 26 fast ml-2013-04-17-Regression as classification