fast_ml fast_ml-2013 fast_ml-2013-21 knowledge-graph by maker-knowledge-mining

21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data


meta infos for this blog

Source: html

Introduction: Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers. To do this, we turn to topic models, an area of research with roots in natural language processing. In NLP, a training set is called a corpus, and each document is like a row in the set. A document might be three pages of text, or just a few words, as in a tweet. The idea of topic modelling is that you can group words in your corpus into relatively few topics and represent each document as a mixture of these topics. It’s attractive because you can interpret the model by looking at words that form the topics. Sometimes they seem meaningful, sometimes not. A meaningful topic might be, for example: “cric


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. [sent-2, score-0.234]

2 Surely we could somehow compress such data to fewer real numbers. [sent-4, score-0.212]

3 In NLP, a training set is called a corpus, and each document is like a row in the set. [sent-6, score-0.18]

4 A document might be three pages of text, or just a few words, as in a tweet. [sent-7, score-0.298]

5 The idea of topic modelling is that you can group words in your corpus into relatively few topics and represent each document as a mixture of these topics. [sent-8, score-0.733]

6 A meaningful topic might be, for example: “cricket”, “basketball”, “championships”, “players”, “win” and so on - in short, “sports”. [sent-11, score-0.29]

7 An interesting twist is to take topic methods and apply them to non-text data to compress it from sparse binary to dense continuous representation. [sent-12, score-0.537]

8 We think that the whole point is to reduce dimensionality so we can go non-linear, which would be too costly and too prone to overfitting with thousands of binary features. [sent-16, score-0.281]

9 It implements three methods we could use: latent semantic indexing (LSI, or LSA - A for Analysis), latent Dirichlet Allocation (LDA), and random projections (RP). [sent-18, score-0.376]

10 We think of LSI as PCA for binary or multinomial data. [sent-19, score-0.21]

11 A nice thing about Gensim is that it implements online versions of each technique, that is, we don’t need to load a whole set into memory. [sent-23, score-0.189]

12 TF-IDF stands for term frequency - inverse document frequency and basically is a fast pre-processing step used with LSI and RP, but not with LDA. [sent-29, score-0.485]

13 Some of them come from discretizing real values, so it’s not an ideal testing scenario, because in effect conversion is partly real to binary and then back to real. [sent-43, score-0.612]

14 It gets 85% accuracy and by the way, it seems that you can get a similiar number with a linear model, so again - this set is not ideal for drawing conclusions. [sent-46, score-0.171]

15 Then we apply LSI, LDA and RP transformation with 10 topics and train a random forest on the new data. [sent-47, score-0.284]

16 Basically: the accuracy after conversion is only slightly worse (83%) LSI with TF-IDF and LDA are equally good RP are worse (80%) LDA with TF-IDF is worse still (78%), so don’t do it LDA seems to produce a sparser representation: -1 1:0. [sent-49, score-0.488]

17 We also tried LSI with 20 topics instead of 10 and the score is the same. [sent-61, score-0.23]

18 That might suggest that the inherent dimensionality of this set is at most 10. [sent-62, score-0.243]

19 There is a way to look closer at this, similiar to inspecting variance of principal components in PCA: >>> lsi = models. [sent-63, score-0.508]

20 36230187]) When you plot s , you notice two “elbows”, at topics = 2 and topics = 6. [sent-86, score-0.46]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('lsi', 0.394), ('rp', 0.338), ('lda', 0.338), ('topics', 0.23), ('gensim', 0.225), ('document', 0.18), ('binary', 0.154), ('topic', 0.143), ('compress', 0.135), ('implements', 0.135), ('sparser', 0.135), ('ideal', 0.113), ('corpus', 0.113), ('inherent', 0.113), ('frequency', 0.099), ('converting', 0.09), ('meaningful', 0.09), ('latent', 0.09), ('worse', 0.09), ('libsvm', 0.085), ('pca', 0.083), ('conversion', 0.083), ('real', 0.077), ('dimensionality', 0.073), ('csv', 0.067), ('words', 0.067), ('sometimes', 0.064), ('three', 0.061), ('help', 0.06), ('similiar', 0.058), ('might', 0.057), ('testing', 0.057), ('directly', 0.056), ('inspecting', 0.056), ('levels', 0.056), ('allocation', 0.056), ('arbitrarily', 0.056), ('formats', 0.056), ('multinomial', 0.056), ('none', 0.056), ('sports', 0.056), ('stands', 0.056), ('win', 0.056), ('apply', 0.054), ('whole', 0.054), ('text', 0.054), ('basically', 0.051), ('sparse', 0.051), ('back', 0.051), ('labels', 0.051)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999976 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

Introduction: Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers. To do this, we turn to topic models, an area of research with roots in natural language processing. In NLP, a training set is called a corpus, and each document is like a row in the set. A document might be three pages of text, or just a few words, as in a tweet. The idea of topic modelling is that you can group words in your corpus into relatively few topics and represent each document as a mixture of these topics. It’s attractive because you can interpret the model by looking at words that form the topics. Sometimes they seem meaningful, sometimes not. A meaningful topic might be, for example: “cric

2 0.35186306 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

Introduction: Last time we explored dimensionality reduction in practice using Gensim’s LSI and LDA. Now, having spent some time researching the subject matter, we will give an overview of other options. UPDATE : We now consider the topic quite irrelevant, because sparse high-dimensional data is precisely where linear models shine. See Amazon aspires to automate access control , Predicting advertised salaries and Predicting closed questions on Stack Overflow . And the few most popular methods are: LSI/LSA - a multinomial PCA LDA - Latent Dirichlet Allocation matrix factorization, in particular non-negative variants: NMF ICA, or Independent Components Analysis mixtures of Bernoullis stacked RBMs correlated topic models, an extension of LDA We tried the first two before. As regards matrix factorization, you do the same stuff as with movie recommendations (think Netflix challenge). The difference is, now all the matrix elements are known and we are only interested in

3 0.10523096 33 fast ml-2013-07-09-Introducing phraug

Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach

4 0.089898385 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

5 0.088751696 46 fast ml-2013-12-07-13 NIPS papers that caught our eye

Introduction: Recently Rob Zinkov published his selection of interesting-looking NIPS papers . Inspired by this, we list some more. Rob seems to like Bayesian stuff, we’re more into neural networks. If you feel like browsing, Andrej Karpathy has a page with all NIPS 2013 papers . They are categorized by topics discovered by running LDA. When you see an interesting paper, you can discover ones ranked similiar by TF-IDF. Here’s what we found. Understanding Dropout Pierre Baldi, Peter J. Sadowski Dropout is a relatively new algorithm for training neural networks which relies on stochastically dropping out neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characte

6 0.087846957 42 fast ml-2013-10-28-How much data is enough?

7 0.078132339 19 fast ml-2013-02-07-The secret of the big guys

8 0.076533645 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

9 0.074310973 32 fast ml-2013-07-05-Processing large files, line by line

10 0.074277297 22 fast ml-2013-03-07-Choosing a machine learning algorithm

11 0.073803522 16 fast ml-2013-01-12-Intro to random forests

12 0.070440635 30 fast ml-2013-06-01-Amazon aspires to automate access control

13 0.068714879 41 fast ml-2013-10-09-Big data made easy

14 0.068434611 20 fast ml-2013-02-18-Predicting advertised salaries

15 0.066222779 40 fast ml-2013-10-06-Pylearn2 in practice

16 0.065891765 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

17 0.064102456 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

18 0.063163444 17 fast ml-2013-01-14-Feature selection in practice

19 0.06176896 39 fast ml-2013-09-19-What you wanted to know about AUC

20 0.060739204 27 fast ml-2013-05-01-Deep learning made easy


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.272), (1, -0.039), (2, 0.076), (3, 0.072), (4, 0.155), (5, 0.372), (6, -0.302), (7, -0.362), (8, -0.185), (9, -0.05), (10, 0.132), (11, 0.175), (12, -0.005), (13, 0.026), (14, 0.178), (15, 0.172), (16, -0.033), (17, -0.05), (18, 0.128), (19, -0.019), (20, -0.095), (21, 0.038), (22, 0.089), (23, 0.087), (24, 0.094), (25, -0.003), (26, -0.028), (27, 0.024), (28, -0.066), (29, 0.069), (30, 0.049), (31, -0.049), (32, -0.04), (33, 0.067), (34, 0.062), (35, -0.014), (36, -0.035), (37, -0.103), (38, -0.031), (39, -0.034), (40, -0.024), (41, -0.016), (42, 0.003), (43, -0.026), (44, 0.033), (45, 0.06), (46, -0.018), (47, -0.011), (48, -0.047), (49, -0.065)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97159857 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

Introduction: Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers. To do this, we turn to topic models, an area of research with roots in natural language processing. In NLP, a training set is called a corpus, and each document is like a row in the set. A document might be three pages of text, or just a few words, as in a tweet. The idea of topic modelling is that you can group words in your corpus into relatively few topics and represent each document as a mixture of these topics. It’s attractive because you can interpret the model by looking at words that form the topics. Sometimes they seem meaningful, sometimes not. A meaningful topic might be, for example: “cric

2 0.83333844 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

Introduction: Last time we explored dimensionality reduction in practice using Gensim’s LSI and LDA. Now, having spent some time researching the subject matter, we will give an overview of other options. UPDATE : We now consider the topic quite irrelevant, because sparse high-dimensional data is precisely where linear models shine. See Amazon aspires to automate access control , Predicting advertised salaries and Predicting closed questions on Stack Overflow . And the few most popular methods are: LSI/LSA - a multinomial PCA LDA - Latent Dirichlet Allocation matrix factorization, in particular non-negative variants: NMF ICA, or Independent Components Analysis mixtures of Bernoullis stacked RBMs correlated topic models, an extension of LDA We tried the first two before. As regards matrix factorization, you do the same stuff as with movie recommendations (think Netflix challenge). The difference is, now all the matrix elements are known and we are only interested in

3 0.21341176 33 fast ml-2013-07-09-Introducing phraug

Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach

4 0.1879447 42 fast ml-2013-10-28-How much data is enough?

Introduction: A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course . The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. In case of a linear model, degrees of freedom essentially equal data dimensionality (a number of columns). We find that thinking in terms of dimensionality vs number of examples is a convenient shortcut. The more powerful the model, the more it’s prone to overfitting and so the more examples you need. And of course the way of controlling this is through validation. Breaking the rules In practice you can get away with less than 10x, especially if your model is simple and uses regularization. In Kaggle competitions the ratio is often closer to 1:1, and sometimes dimensionality is far greater than a number of examples, depending on how you pre-process the data

5 0.18513891 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

6 0.18443039 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

7 0.17311026 19 fast ml-2013-02-07-The secret of the big guys

8 0.16262656 20 fast ml-2013-02-18-Predicting advertised salaries

9 0.15859067 22 fast ml-2013-03-07-Choosing a machine learning algorithm

10 0.15777685 41 fast ml-2013-10-09-Big data made easy

11 0.15478057 32 fast ml-2013-07-05-Processing large files, line by line

12 0.15214542 30 fast ml-2013-06-01-Amazon aspires to automate access control

13 0.14610782 17 fast ml-2013-01-14-Feature selection in practice

14 0.14442581 16 fast ml-2013-01-12-Intro to random forests

15 0.14017029 18 fast ml-2013-01-17-A very fast denoising autoencoder

16 0.13766229 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

17 0.13423502 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

18 0.1333617 36 fast ml-2013-08-23-A bag of words and a nice little network

19 0.13291104 27 fast ml-2013-05-01-Deep learning made easy

20 0.1322889 39 fast ml-2013-09-19-What you wanted to know about AUC


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(6, 0.023), (22, 0.032), (26, 0.071), (31, 0.024), (55, 0.051), (58, 0.032), (65, 0.405), (69, 0.102), (71, 0.063), (78, 0.031), (79, 0.038), (81, 0.013), (99, 0.031)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.8544175 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

Introduction: Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers. To do this, we turn to topic models, an area of research with roots in natural language processing. In NLP, a training set is called a corpus, and each document is like a row in the set. A document might be three pages of text, or just a few words, as in a tweet. The idea of topic modelling is that you can group words in your corpus into relatively few topics and represent each document as a mixture of these topics. It’s attractive because you can interpret the model by looking at words that form the topics. Sometimes they seem meaningful, sometimes not. A meaningful topic might be, for example: “cric

2 0.71859848 34 fast ml-2013-07-14-Running things on a GPU

Introduction: You’ve heard about running things on a graphics card, but have you tried it? All you need to taste the speed is a Nvidia card and some software. We run experiments using Cudamat and Theano in Python. GPUs differ from CPUs in that they are optimized for throughput instead of latency. Here’s a metaphor: when you play an online game, you want fast response times, or low latency. Otherwise you get lag. However when you’re downloading a movie, you don’t care about response times, you care about bandwidth - that’s throughput. Massive computations are similiar to downloading a movie in this respect. The setup We’ll be testing things on a platform with an Intel Dual Core CPU @3Ghz and either GeForce 9600 GT, an old card, or GeForce GTX 550 Ti, a more recent card. See the appendix for more info on GPU hardware. Software we’ll be using is Python. On CPU it employs one core*. That is OK in everyday use because while one core is working hard, you can comfortably do something else becau

3 0.34322172 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

Introduction: Last time we explored dimensionality reduction in practice using Gensim’s LSI and LDA. Now, having spent some time researching the subject matter, we will give an overview of other options. UPDATE : We now consider the topic quite irrelevant, because sparse high-dimensional data is precisely where linear models shine. See Amazon aspires to automate access control , Predicting advertised salaries and Predicting closed questions on Stack Overflow . And the few most popular methods are: LSI/LSA - a multinomial PCA LDA - Latent Dirichlet Allocation matrix factorization, in particular non-negative variants: NMF ICA, or Independent Components Analysis mixtures of Bernoullis stacked RBMs correlated topic models, an extension of LDA We tried the first two before. As regards matrix factorization, you do the same stuff as with movie recommendations (think Netflix challenge). The difference is, now all the matrix elements are known and we are only interested in

4 0.31433058 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

5 0.30324325 19 fast ml-2013-02-07-The secret of the big guys

Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni

6 0.30003205 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

7 0.28740826 20 fast ml-2013-02-18-Predicting advertised salaries

8 0.28708547 16 fast ml-2013-01-12-Intro to random forests

9 0.28147742 18 fast ml-2013-01-17-A very fast denoising autoencoder

10 0.28099674 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

11 0.27931434 32 fast ml-2013-07-05-Processing large files, line by line

12 0.27750406 40 fast ml-2013-10-06-Pylearn2 in practice

13 0.27518049 25 fast ml-2013-04-10-Gender discrimination

14 0.27329138 13 fast ml-2012-12-27-Spearmint with a random forest

15 0.27248424 17 fast ml-2013-01-14-Feature selection in practice

16 0.26669511 27 fast ml-2013-05-01-Deep learning made easy

17 0.2665509 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

18 0.26538607 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

19 0.26524237 30 fast ml-2013-06-01-Amazon aspires to automate access control

20 0.2651704 43 fast ml-2013-11-02-Maxing out the digits