fast_ml fast_ml-2013 fast_ml-2013-42 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course . The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. In case of a linear model, degrees of freedom essentially equal data dimensionality (a number of columns). We find that thinking in terms of dimensionality vs number of examples is a convenient shortcut. The more powerful the model, the more it’s prone to overfitting and so the more examples you need. And of course the way of controlling this is through validation. Breaking the rules In practice you can get away with less than 10x, especially if your model is simple and uses regularization. In Kaggle competitions the ratio is often closer to 1:1, and sometimes dimensionality is far greater than a number of examples, depending on how you pre-process the data
sentIndex sentText sentNum sentScore
1 A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. [sent-1, score-0.557]
2 Yaser Abu-Mostafa from Caltech answered this very question in his online course . [sent-3, score-0.504]
3 The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. [sent-4, score-1.113]
4 In case of a linear model, degrees of freedom essentially equal data dimensionality (a number of columns). [sent-5, score-1.22]
5 We find that thinking in terms of dimensionality vs number of examples is a convenient shortcut. [sent-6, score-0.824]
6 The more powerful the model, the more it’s prone to overfitting and so the more examples you need. [sent-7, score-0.535]
7 And of course the way of controlling this is through validation. [sent-8, score-0.143]
8 Breaking the rules In practice you can get away with less than 10x, especially if your model is simple and uses regularization. [sent-9, score-0.47]
9 In Kaggle competitions the ratio is often closer to 1:1, and sometimes dimensionality is far greater than a number of examples, depending on how you pre-process the data. [sent-10, score-0.99]
10 Specifically, text represented as a bag of words may be very high-dimensional and very sparse. [sent-11, score-0.321]
11 For instance, consider Online Learning Library experiments on a binary version of News20 dataset. [sent-12, score-0.095]
12 With 15k training points and well above million features you can get 96% accuracy (and not because the classes are skewed - they are perfectly balanced). [sent-13, score-0.533]
wordName wordTfidf (topN-words)
[('degrees', 0.359), ('freedom', 0.359), ('examples', 0.215), ('dimensionality', 0.194), ('online', 0.158), ('caltech', 0.149), ('yaser', 0.149), ('ratio', 0.149), ('breaking', 0.149), ('greater', 0.149), ('represented', 0.149), ('course', 0.143), ('million', 0.132), ('away', 0.132), ('thinking', 0.132), ('prone', 0.132), ('depending', 0.132), ('skewed', 0.132), ('essentially', 0.132), ('answered', 0.132), ('reader', 0.119), ('meaningful', 0.119), ('vs', 0.119), ('competitions', 0.119), ('perfectly', 0.119), ('reddit', 0.119), ('powerful', 0.109), ('project', 0.109), ('needed', 0.109), ('model', 0.104), ('equal', 0.101), ('asked', 0.101), ('closer', 0.101), ('answer', 0.101), ('bag', 0.101), ('experiments', 0.095), ('library', 0.095), ('convenient', 0.089), ('practice', 0.084), ('roughly', 0.079), ('classes', 0.079), ('especially', 0.079), ('overfitting', 0.079), ('number', 0.075), ('far', 0.071), ('question', 0.071), ('points', 0.071), ('text', 0.071), ('uses', 0.071), ('columns', 0.071)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999976 42 fast ml-2013-10-28-How much data is enough?
Introduction: A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course . The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. In case of a linear model, degrees of freedom essentially equal data dimensionality (a number of columns). We find that thinking in terms of dimensionality vs number of examples is a convenient shortcut. The more powerful the model, the more it’s prone to overfitting and so the more examples you need. And of course the way of controlling this is through validation. Breaking the rules In practice you can get away with less than 10x, especially if your model is simple and uses regularization. In Kaggle competitions the ratio is often closer to 1:1, and sometimes dimensionality is far greater than a number of examples, depending on how you pre-process the data
2 0.096734762 15 fast ml-2013-01-07-Machine learning courses online
Introduction: How do you learn machine learning? A good way to begin is to take an online course. These courses started appearing towards the end of 2011, first from Stanford University, now from Coursera , Udacity , edX and other institutions. There are very many of them, including a few about machine learning. Here’s a list: Introduction to Artificial Intelligence by Sebastian Thrun and Peter Norvig. That was the first online class, and it contains two units on machine learning (units five and six). Both instructors work at Google. Sebastian Thrun is best known for building a self-driving car and Peter Norvig is a leading authority on AI, so they know what they are talking about. After the success of the class Sebastian Thrun quit Stanford to found Udacity, his online learning startup. Machine Learning by Andrew Ng. Again, one of the first classes, by Stanford professor who started Coursera, the best known online learning provider today. Andrew Ng is a world class authority on m
3 0.087846957 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
Introduction: Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers. To do this, we turn to topic models, an area of research with roots in natural language processing. In NLP, a training set is called a corpus, and each document is like a row in the set. A document might be three pages of text, or just a few words, as in a tweet. The idea of topic modelling is that you can group words in your corpus into relatively few topics and represent each document as a mixture of these topics. It’s attractive because you can interpret the model by looking at words that form the topics. Sometimes they seem meaningful, sometimes not. A meaningful topic might be, for example: “cric
4 0.074314132 25 fast ml-2013-04-10-Gender discrimination
Introduction: There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not? Congratulations! You have spotted the ceiling cat. As Sashi noticed on the forums , it’s not difficult to improve on the benchmarks a little bit. In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. Here’s our version of the story. Let’s start with normalizing. There’s a nice function for that in R, scale() . The dataset is small, 1128 examples, so we can go ahead and use R. Turns out that in its raw form, the data won’t scale. That’s because apparently there are some columns with zeros only. That makes it difficult to divide. Fortunately, we know just the right tool for the task. We learned about it on Kaggle forums too. It’s a function in caret package called nearZeroVar() . It will give you indexes of all the columns which have near zero var
5 0.06590046 19 fast ml-2013-02-07-The secret of the big guys
Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni
6 0.065811172 20 fast ml-2013-02-18-Predicting advertised salaries
7 0.063641801 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
8 0.063140333 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
9 0.056780953 41 fast ml-2013-10-09-Big data made easy
10 0.056748562 32 fast ml-2013-07-05-Processing large files, line by line
11 0.053303532 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
12 0.049535628 18 fast ml-2013-01-17-A very fast denoising autoencoder
13 0.049083162 43 fast ml-2013-11-02-Maxing out the digits
14 0.048519265 40 fast ml-2013-10-06-Pylearn2 in practice
15 0.047586743 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
16 0.047058064 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
17 0.0467196 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
18 0.044696063 39 fast ml-2013-09-19-What you wanted to know about AUC
19 0.044370964 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
20 0.044310853 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
topicId topicWeight
[(0, 0.194), (1, 0.005), (2, 0.059), (3, -0.001), (4, 0.051), (5, 0.108), (6, -0.092), (7, -0.015), (8, 0.073), (9, 0.021), (10, -0.021), (11, 0.18), (12, -0.241), (13, -0.18), (14, 0.043), (15, 0.049), (16, 0.331), (17, 0.214), (18, 0.064), (19, 0.04), (20, 0.35), (21, 0.371), (22, 0.037), (23, 0.001), (24, -0.076), (25, -0.166), (26, 0.011), (27, 0.052), (28, 0.17), (29, -0.054), (30, -0.084), (31, -0.032), (32, 0.297), (33, 0.001), (34, 0.157), (35, -0.09), (36, 0.111), (37, -0.152), (38, 0.031), (39, 0.044), (40, 0.035), (41, -0.265), (42, -0.021), (43, 0.158), (44, -0.061), (45, -0.049), (46, -0.042), (47, 0.044), (48, -0.102), (49, -0.035)]
simIndex simValue blogId blogTitle
same-blog 1 0.98680818 42 fast ml-2013-10-28-How much data is enough?
Introduction: A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course . The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. In case of a linear model, degrees of freedom essentially equal data dimensionality (a number of columns). We find that thinking in terms of dimensionality vs number of examples is a convenient shortcut. The more powerful the model, the more it’s prone to overfitting and so the more examples you need. And of course the way of controlling this is through validation. Breaking the rules In practice you can get away with less than 10x, especially if your model is simple and uses regularization. In Kaggle competitions the ratio is often closer to 1:1, and sometimes dimensionality is far greater than a number of examples, depending on how you pre-process the data
2 0.21215568 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
Introduction: Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers. To do this, we turn to topic models, an area of research with roots in natural language processing. In NLP, a training set is called a corpus, and each document is like a row in the set. A document might be three pages of text, or just a few words, as in a tweet. The idea of topic modelling is that you can group words in your corpus into relatively few topics and represent each document as a mixture of these topics. It’s attractive because you can interpret the model by looking at words that form the topics. Sometimes they seem meaningful, sometimes not. A meaningful topic might be, for example: “cric
3 0.21017814 15 fast ml-2013-01-07-Machine learning courses online
Introduction: How do you learn machine learning? A good way to begin is to take an online course. These courses started appearing towards the end of 2011, first from Stanford University, now from Coursera , Udacity , edX and other institutions. There are very many of them, including a few about machine learning. Here’s a list: Introduction to Artificial Intelligence by Sebastian Thrun and Peter Norvig. That was the first online class, and it contains two units on machine learning (units five and six). Both instructors work at Google. Sebastian Thrun is best known for building a self-driving car and Peter Norvig is a leading authority on AI, so they know what they are talking about. After the success of the class Sebastian Thrun quit Stanford to found Udacity, his online learning startup. Machine Learning by Andrew Ng. Again, one of the first classes, by Stanford professor who started Coursera, the best known online learning provider today. Andrew Ng is a world class authority on m
4 0.1489998 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
Introduction: The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples. We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set. What you wanted to know about Lasso and Ridge L1 and L2 are both ways of regularization sometimes called weight decay . Basically, we include parameter weights in a cost function. In effect, the model will try to minimize those weights by going “down the slope”. Example weights: in a linear model or in a neural network. L1 is known as Lasso and L2 is known as Ridge. These names may be confusing, because a chart of Lasso looks like a ridge and a
5 0.14563045 19 fast ml-2013-02-07-The secret of the big guys
Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni
6 0.14178093 25 fast ml-2013-04-10-Gender discrimination
7 0.13544346 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
8 0.13504674 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition
9 0.12573542 20 fast ml-2013-02-18-Predicting advertised salaries
10 0.11473097 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
11 0.11453743 22 fast ml-2013-03-07-Choosing a machine learning algorithm
12 0.11241952 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
13 0.11181451 41 fast ml-2013-10-09-Big data made easy
14 0.11176665 18 fast ml-2013-01-17-A very fast denoising autoencoder
15 0.11009595 32 fast ml-2013-07-05-Processing large files, line by line
16 0.10750897 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
17 0.1036278 27 fast ml-2013-05-01-Deep learning made easy
18 0.099338368 28 fast ml-2013-05-12-And deliver us from Weka
19 0.098428354 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit
20 0.098016158 40 fast ml-2013-10-06-Pylearn2 in practice
topicId topicWeight
[(26, 0.028), (31, 0.029), (58, 0.698), (69, 0.076), (71, 0.05)]
simIndex simValue blogId blogTitle
same-blog 1 0.96186352 42 fast ml-2013-10-28-How much data is enough?
Introduction: A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course . The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model. In case of a linear model, degrees of freedom essentially equal data dimensionality (a number of columns). We find that thinking in terms of dimensionality vs number of examples is a convenient shortcut. The more powerful the model, the more it’s prone to overfitting and so the more examples you need. And of course the way of controlling this is through validation. Breaking the rules In practice you can get away with less than 10x, especially if your model is simple and uses regularization. In Kaggle competitions the ratio is often closer to 1:1, and sometimes dimensionality is far greater than a number of examples, depending on how you pre-process the data
2 0.25546902 28 fast ml-2013-05-12-And deliver us from Weka
Introduction: Sometimes, fortunately not very often, we see people mention Weka as useful machine learning software. This is misleading, because Weka is just a toy: it can give a beginner a bit of taste of machine learning, but if you want to accomplish anything meaningful, there are many way better tools. Warning: this is a post with strong opinions. Weka has at least two serious shortcomings: It’s written in Java. We think that Java has its place*, but not really in desktop machine learning. Particularly, Java is very memory-hungry and it means that you either constrain yourself to really small data or buy some RAM. This also applies to other popular GUI tools like Rapid Miner or KNIME. It is no accident that some of the best known Java projects concern themselves with using many machines. One is just not enough with Java. Weka people chose another way: their next project, MOA, is about data stream mining. Online learning, in other words. That way you only need one example in memor
3 0.18243736 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
Introduction: Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers. To do this, we turn to topic models, an area of research with roots in natural language processing. In NLP, a training set is called a corpus, and each document is like a row in the set. A document might be three pages of text, or just a few words, as in a tweet. The idea of topic modelling is that you can group words in your corpus into relatively few topics and represent each document as a mixture of these topics. It’s attractive because you can interpret the model by looking at words that form the topics. Sometimes they seem meaningful, sometimes not. A meaningful topic might be, for example: “cric
4 0.16791718 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
Introduction: The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples. We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set. What you wanted to know about Lasso and Ridge L1 and L2 are both ways of regularization sometimes called weight decay . Basically, we include parameter weights in a cost function. In effect, the model will try to minimize those weights by going “down the slope”. Example weights: in a linear model or in a neural network. L1 is known as Lasso and L2 is known as Ridge. These names may be confusing, because a chart of Lasso looks like a ridge and a
5 0.16129836 39 fast ml-2013-09-19-What you wanted to know about AUC
Introduction: AUC, or Area Under Curve, is a metric for binary classification. It’s probably the second most popular one, after accuracy. Unfortunately, it’s nowhere near as intuitive. That is, until you have read this article. Accuracy deals with ones and zeros, meaning you either got the class label right or you didn’t. But many classifiers are able to quantify their uncertainty about the answer by outputting a probability value. To compute accuracy from probabilities you need a threshold to decide when zero turns into one. The most natural threshold is of course 0.5. Let’s suppose you have a quirky classifier. It is able to get all the answers right, but it outputs 0.7 for negative examples and 0.9 for positive examples. Clearly, a threshold of 0.5 won’t get you far here. But 0.8 would be just perfect. That’s the whole point of using AUC - it considers all possible thresholds. Various thresholds result in different true positive/false positive rates. As you decrease the threshold, you get
6 0.15449125 20 fast ml-2013-02-18-Predicting advertised salaries
7 0.15408266 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
8 0.14557278 18 fast ml-2013-01-17-A very fast denoising autoencoder
9 0.14546376 19 fast ml-2013-02-07-The secret of the big guys
10 0.14418085 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
11 0.14399019 43 fast ml-2013-11-02-Maxing out the digits
12 0.14116542 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
13 0.13852383 9 fast ml-2012-10-25-So you want to work for Facebook
14 0.13787161 41 fast ml-2013-10-09-Big data made easy
15 0.13692278 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
16 0.13639468 25 fast ml-2013-04-10-Gender discrimination
17 0.13532525 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
18 0.13315946 26 fast ml-2013-04-17-Regression as classification
19 0.13054428 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
20 0.12880126 53 fast ml-2014-02-20-Are stocks predictable?