fast_ml fast_ml-2013 fast_ml-2013-41 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or
sentIndex sentText sentNum sentScore
1 This post was inspired by a very good article about big data by Chris Stucchio (linked below). [sent-2, score-0.589]
2 Big data is hype Everybody talks about big data; nobody knows exactly what it is. [sent-5, score-0.9]
3 Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). [sent-7, score-0.195]
4 Now, to put things in context: Big data is right there (or maybe not quite yet? [sent-8, score-0.2]
5 In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. [sent-11, score-0.922]
6 What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. [sent-13, score-1.052]
7 Before cloud computing it was data warehousing or ‘software as a service’. [sent-14, score-0.349]
8 There’s a new buzzword every two years and the computer analysts come out with these things so that they will have something to consult about. [sent-15, score-0.223]
9 uk] Also see Most data isn’t big , and businesses are wasting money pretending it is and a paper from Microsoft: Nobody ever got fired for buying a cluster . [sent-18, score-0.595]
10 Another way to say it: big data is like a teenage sex… You already know this meme, don’t you? [sent-19, score-0.473]
11 Big data is technical difficulty Big data can be defined in terms of technical difficulty it causes. [sent-20, score-0.773]
12 For example, when deciding if your data is big, you could draw a line on whether it fits comfortably* into memory. [sent-21, score-0.332]
13 The point is, as data grows larger it becomes more difficult to process it. [sent-23, score-0.216]
14 The author is talking about Map Reduce: The only reason to put on this straightjacket is that by doing so, you can scale up to extremely large data sets. [sent-25, score-0.407]
15 Big data is effective If it’s hype and a source of difficulties, why bother? [sent-30, score-0.454]
16 In machine learning, particularly, more examples usually is better, especially when data dimensionality is high. [sent-35, score-0.193]
17 Which brings us to the last point… Big data is spying Consider a task the big guys like Google, Facebook etc. [sent-40, score-0.473]
18 are dealing with: they have visitor data from hundreds of thousands or millions or n sites. [sent-41, score-0.204]
19 Apple published a transparency report, notable for its, let’s say, tagline: our business does not depend on collecting personal data . [sent-52, score-0.334]
20 For a company not interested in personal data they’re pretty nosy, aren’t they? [sent-54, score-0.273]
wordName wordTfidf (topN-words)
[('big', 0.346), ('corpus', 0.244), ('apple', 0.22), ('hype', 0.22), ('cloud', 0.161), ('analysts', 0.146), ('brown', 0.146), ('nobody', 0.146), ('nsa', 0.146), ('personal', 0.146), ('sites', 0.146), ('straightjacket', 0.146), ('stucchio', 0.146), ('term', 0.134), ('google', 0.132), ('data', 0.127), ('ram', 0.122), ('chris', 0.122), ('difficulty', 0.122), ('article', 0.116), ('errors', 0.107), ('comfortably', 0.107), ('technical', 0.107), ('effective', 0.107), ('web', 0.107), ('norvig', 0.097), ('larger', 0.089), ('whether', 0.083), ('computer', 0.077), ('dealing', 0.077), ('put', 0.073), ('examples', 0.066), ('computing', 0.061), ('got', 0.061), ('fits', 0.061), ('money', 0.061), ('draw', 0.061), ('quiet', 0.061), ('extremely', 0.061), ('overview', 0.061), ('talks', 0.061), ('aren', 0.061), ('linked', 0.061), ('media', 0.061), ('mining', 0.061), ('social', 0.061), ('announced', 0.061), ('business', 0.061), ('coming', 0.061), ('defined', 0.061)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000005 41 fast ml-2013-10-09-Big data made easy
Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or
2 0.13892294 37 fast ml-2013-09-03-Our followers and who else they follow
Introduction: Recently we hit 400 followers mark on Twitter. To celebrate we decided to do some data mining on you , specifically to discover who our followers are and who else they follow. For your viewing pleasure we packaged the results nicely with Bootstrap. Here’s some data science in action. Our followers This table show our 20 most popular followers as measeared by their follower count. The occasional question marks stand for non-ASCII characters. Each link opens a new window. Followers Screen name Name Description 8685 pankaj Pankaj Gupta I lead the Personalization and Recommender Systems group at Twitter. Founded two startups in the past. 5070 ogrisel Olivier Grisel Datageek, contributor to scikit-learn, works with Python / Java / Clojure / Pig, interested in Machine Learning, NLProc, {Big|Linked|Open} Data and braaains! 4582 thuske thuske & 4442 ram Ram Ravichandran So
3 0.099074133 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
Introduction: On May 15th Yann LeCun answered “ask me anything” questions on Reddit . We hand-picked some of his thoughts and grouped them by topic for your enjoyment. Toronto, Montreal and New York All three groups are strong and complementary. Geoff (who spends more time at Google than in Toronto now) and Russ Salakhutdinov like RBMs and deep Boltzmann machines. I like the idea of Boltzmann machines (it’s a beautifully simple concept) but it doesn’t scale well. Also, I totally hate sampling. Yoshua and his colleagues have focused a lot on various unsupervised learning, including denoising auto-encoders, contracting auto-encoders. They are not allergic to sampling like I am. On the application side, they have worked on text, not so much on images. In our lab at NYU (Rob Fergus, David Sontag, me and our students and postdocs), we have been focusing on sparse auto-encoders for unsupervised learning. They have the advantage of scaling well. We have also worked on applications, mostly to v
4 0.093386173 47 fast ml-2013-12-15-A-B testing with bayesian bandits in Google Analytics
Introduction: A/B testing is a way to optimize a web page. Half of visitors see one version, the other half another, so you can tell which version is more conducive to your goal - for example selling something. Since June 2013 A/B testing can be conveniently done with Google Analytics. Here’s how. This article is not quite about machine learning. If you’re not interested in testing, scroll down to the bayesian bandits section . Google Content Experiments We remember Google Website Optimizer from a few years ago. It wasn’t exactly user friendly or slick, but it felt solid and did the job. Unfortunately, at one point in time Google pulled the plug, leaving Genetify as a sole free (and open source) tool for multivariate testing. Multivariate means testing a few elements on a page simultanously. At that time they launched Content Experiments in Google Analytics, but it was a giant step backward. Content experiments were very primitive and only allowed rudimentary A/B split testing. It i
5 0.090366691 19 fast ml-2013-02-07-The secret of the big guys
Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni
6 0.087565042 5 fast ml-2012-09-19-Best Buy mobile contest - big data
7 0.085809767 15 fast ml-2013-01-07-Machine learning courses online
8 0.073488533 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
9 0.068714879 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
10 0.068446122 27 fast ml-2013-05-01-Deep learning made easy
11 0.065416977 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
12 0.056780953 42 fast ml-2013-10-28-How much data is enough?
13 0.054281265 20 fast ml-2013-02-18-Predicting advertised salaries
14 0.052937668 57 fast ml-2014-04-01-Exclusive Geoff Hinton interview
15 0.05271196 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
16 0.052089911 32 fast ml-2013-07-05-Processing large files, line by line
17 0.051545501 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
18 0.049779098 2 fast ml-2012-08-27-Kaggle job recommendation challenge
19 0.049383242 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
20 0.046895109 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
topicId topicWeight
[(0, 0.215), (1, 0.049), (2, 0.151), (3, -0.118), (4, 0.124), (5, 0.12), (6, -0.04), (7, 0.13), (8, 0.272), (9, 0.027), (10, 0.008), (11, 0.202), (12, -0.062), (13, -0.126), (14, 0.107), (15, -0.015), (16, -0.006), (17, 0.034), (18, 0.202), (19, 0.16), (20, -0.137), (21, -0.174), (22, -0.324), (23, 0.077), (24, -0.014), (25, 0.12), (26, -0.075), (27, -0.225), (28, 0.171), (29, -0.059), (30, -0.129), (31, -0.079), (32, -0.076), (33, 0.158), (34, -0.034), (35, 0.163), (36, -0.046), (37, 0.01), (38, -0.148), (39, -0.056), (40, -0.021), (41, -0.253), (42, 0.363), (43, -0.053), (44, 0.036), (45, 0.102), (46, 0.059), (47, 0.027), (48, 0.026), (49, -0.131)]
simIndex simValue blogId blogTitle
same-blog 1 0.98026085 41 fast ml-2013-10-09-Big data made easy
Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or
2 0.36318693 37 fast ml-2013-09-03-Our followers and who else they follow
Introduction: Recently we hit 400 followers mark on Twitter. To celebrate we decided to do some data mining on you , specifically to discover who our followers are and who else they follow. For your viewing pleasure we packaged the results nicely with Bootstrap. Here’s some data science in action. Our followers This table show our 20 most popular followers as measeared by their follower count. The occasional question marks stand for non-ASCII characters. Each link opens a new window. Followers Screen name Name Description 8685 pankaj Pankaj Gupta I lead the Personalization and Recommender Systems group at Twitter. Founded two startups in the past. 5070 ogrisel Olivier Grisel Datageek, contributor to scikit-learn, works with Python / Java / Clojure / Pig, interested in Machine Learning, NLProc, {Big|Linked|Open} Data and braaains! 4582 thuske thuske & 4442 ram Ram Ravichandran So
3 0.20904961 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA
Introduction: On May 15th Yann LeCun answered “ask me anything” questions on Reddit . We hand-picked some of his thoughts and grouped them by topic for your enjoyment. Toronto, Montreal and New York All three groups are strong and complementary. Geoff (who spends more time at Google than in Toronto now) and Russ Salakhutdinov like RBMs and deep Boltzmann machines. I like the idea of Boltzmann machines (it’s a beautifully simple concept) but it doesn’t scale well. Also, I totally hate sampling. Yoshua and his colleagues have focused a lot on various unsupervised learning, including denoising auto-encoders, contracting auto-encoders. They are not allergic to sampling like I am. On the application side, they have worked on text, not so much on images. In our lab at NYU (Rob Fergus, David Sontag, me and our students and postdocs), we have been focusing on sparse auto-encoders for unsupervised learning. They have the advantage of scaling well. We have also worked on applications, mostly to v
4 0.2015253 19 fast ml-2013-02-07-The secret of the big guys
Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni
5 0.18042299 5 fast ml-2012-09-19-Best Buy mobile contest - big data
Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before
6 0.16719963 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
7 0.15322889 20 fast ml-2013-02-18-Predicting advertised salaries
8 0.14178911 47 fast ml-2013-12-15-A-B testing with bayesian bandits in Google Analytics
9 0.13989198 59 fast ml-2014-04-21-Predicting happiness from demographics and poll answers
10 0.13874939 2 fast ml-2012-08-27-Kaggle job recommendation challenge
11 0.13741782 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
12 0.13390027 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition
13 0.1318763 27 fast ml-2013-05-01-Deep learning made easy
14 0.12595797 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
15 0.121056 40 fast ml-2013-10-06-Pylearn2 in practice
16 0.11975475 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview
17 0.11971775 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial
18 0.1195166 42 fast ml-2013-10-28-How much data is enough?
19 0.11668549 15 fast ml-2013-01-07-Machine learning courses online
20 0.11421274 25 fast ml-2013-04-10-Gender discrimination
topicId topicWeight
[(26, 0.032), (31, 0.052), (35, 0.034), (55, 0.016), (58, 0.022), (69, 0.098), (71, 0.018), (73, 0.015), (78, 0.03), (81, 0.012), (84, 0.034), (92, 0.485), (99, 0.072)]
simIndex simValue blogId blogTitle
same-blog 1 0.88875175 41 fast ml-2013-10-09-Big data made easy
Introduction: An overview of key points about big data. This post was inspired by a very good article about big data by Chris Stucchio (linked below). The article is about hype and technology. We hate the hype. Big data is hype Everybody talks about big data; nobody knows exactly what it is. That’s pretty much the definition of hype. Google Trends suggest that the term took off at the beginning of 2011 (and the searches are coming mainly from Asia, curiously). Now, to put things in context: Big data is right there (or maybe not quite yet?) with other slogans like web 2.0 , cloud computing and social media . In effect, big data is a generic term for: data science machine learning data mining predictive analytics and so on. Don’t believe us? What about James Goodnight, the CEO of SAS : The term big data is being used today because computer analysts and journalists got tired of writing about cloud computing. Before cloud computing it was data warehousing or
2 0.24872841 19 fast ml-2013-02-07-The secret of the big guys
Introduction: Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster. We have already written about Vowpal Wabbit , a fast linear learner from Yahoo/Microsoft. Google’s response (or at least, a Google’s guy response) seems to be Sofia-ML . The software consists of two parts: a linear learner and K-means clustering. We found Sofia a while ago and wondered about K-means: who needs K-means? Here’s a clue: This package can be used for learning cluster centers (…) and for mapping a given data set onto a new feature space based on the learned cluster centers. Our eyes only opened when we read a certain paper, namely An Analysis of Single-Layer Networks in Unsupervised Feature Learning ( PDF ). The paper, by Coates , Lee and Ng, is about object recogni
3 0.24214117 9 fast ml-2012-10-25-So you want to work for Facebook
Introduction: Good news, everyone! There’s a new contest on Kaggle - Facebook is looking for talent . They won’t pay, but just might interview. This post is in a way a bonus for active readers because most visitors of fastml.com originally come from Kaggle forums. For this competition the forums are disabled to encourage own work . To honor this, we won’t publish any code. But own work doesn’t mean original work , and we wouldn’t want to reinvent the wheel, would we? The contest differs substantially from a Kaggle stereotype, if there is such a thing, in three major ways: there’s no money prizes, as mentioned above it’s not a real world problem, but rather an assignment to screen job candidates (this has important consequences, described below) it’s not a typical machine learning project, but rather a broader AI exercise You are given a graph of the internet, actually a snapshot of the graph for each of 15 time steps. You are also given a bunch of paths in this graph, which a
4 0.23838973 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
Introduction: We continue with CIFAR-10-based competition at Kaggle to get to know DropConnect. It’s supposed to be an improvement over dropout. And dropout is certainly one of the bigger steps forward in neural network development. Is DropConnect really better than dropout? TL;DR DropConnect seems to offer results similiar to dropout. State of the art scores reported in the paper come from model ensembling. Dropout Dropout , by Hinton et al., is perhaps a biggest invention in the field of neural networks in recent years. It adresses the main problem in machine learning, that is overfitting. It does so by “dropping out” some unit activations in a given layer, that is setting them to zero. Thus it prevents co-adaptation of units and can also be seen as a method of ensembling many networks sharing the same weights. For each training example a different set of units to drop is randomly chosen. The idea has a biological inspiration . When a child is conceived, it receives half its genes f
5 0.23540984 16 fast ml-2013-01-12-Intro to random forests
Introduction: Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive? First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks. There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting). Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods. Besides effectivnes
6 0.2251019 27 fast ml-2013-05-01-Deep learning made easy
7 0.22497991 18 fast ml-2013-01-17-A very fast denoising autoencoder
8 0.22157489 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
9 0.22127399 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
10 0.22114109 54 fast ml-2014-03-06-PyBrain - a simple neural networks library in Python
11 0.21993391 43 fast ml-2013-11-02-Maxing out the digits
12 0.21906021 40 fast ml-2013-10-06-Pylearn2 in practice
13 0.21681106 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction
14 0.21661294 13 fast ml-2012-12-27-Spearmint with a random forest
15 0.21612206 4 fast ml-2012-09-17-Best Buy mobile contest
16 0.21470615 17 fast ml-2013-01-14-Feature selection in practice
17 0.21447825 61 fast ml-2014-05-08-Impute missing values with Amelia
18 0.21331151 25 fast ml-2013-04-10-Gender discrimination
19 0.2132885 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
20 0.2125206 62 fast ml-2014-05-26-Yann LeCun's answers from the Reddit AMA