fast_ml fast_ml-2013 fast_ml-2013-33 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach
sentIndex sentText sentNum sentScore
1 Recently we proposed to pre-process large files line by line. [sent-1, score-0.488]
2 Now it’s time to introduce phraug *, a set of Python scripts based on this idea. [sent-2, score-0.584]
3 The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. [sent-3, score-0.649]
4 If you’re familiar with Unix, you may notice that some of these tasks are easily achieved using command line utilities. [sent-6, score-0.572]
5 For example, you can count lines with wc -l or see a beginning of a file with head . [sent-7, score-0.752]
6 Moreover, there are apps like sed and awk that allow for more complicated operations. [sent-8, score-0.084]
7 On Windows there are some tools, for example with more you can preview large files, but generally such functionality is mostly lacking. [sent-9, score-0.368]
8 There’s a good option to remedy this: installing Cygwin , which provides all the important command line tools from Unix, including the bash shell. [sent-10, score-0.459]
9 Still, for things like format conversion Python scripting is a good choice. [sent-11, score-0.237]
10 One reason is that you can easily modify those scripts to suit your needs. [sent-12, score-0.509]
11 We found that each project differs slightly in pre-processing requirements and we usually tweak an existing script or write a new one based on the basic template described in Processing large files, line by line . [sent-13, score-0.782]
12 The files and usage information are available at Github. [sent-14, score-0.125]
wordName wordTfidf (topN-words)
[('csv', 0.397), ('file', 0.302), ('lines', 0.214), ('scripts', 0.209), ('libsvm', 0.192), ('phraug', 0.19), ('count', 0.152), ('line', 0.15), ('conversion', 0.139), ('tools', 0.129), ('files', 0.125), ('easily', 0.121), ('tasks', 0.121), ('large', 0.118), ('unix', 0.113), ('always', 0.101), ('based', 0.101), ('format', 0.098), ('command', 0.096), ('sized', 0.095), ('book', 0.095), ('shift', 0.095), ('tweak', 0.095), ('chunks', 0.095), ('similiarly', 0.095), ('cygwin', 0.095), ('proposed', 0.095), ('suit', 0.095), ('columns', 0.091), ('vowpal', 0.087), ('wabbit', 0.087), ('input', 0.087), ('apps', 0.084), ('normalize', 0.084), ('differs', 0.084), ('functionality', 0.084), ('introduce', 0.084), ('moreover', 0.084), ('beginning', 0.084), ('option', 0.084), ('operations', 0.084), ('preview', 0.084), ('specified', 0.084), ('modify', 0.084), ('continuous', 0.084), ('delete', 0.084), ('existing', 0.084), ('familiar', 0.084), ('mostly', 0.082), ('split', 0.082)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000002 33 fast ml-2013-07-09-Introducing phraug
Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach
2 0.2722753 32 fast ml-2013-07-05-Processing large files, line by line
Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re
3 0.17408623 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.
4 0.17016833 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
5 0.10777096 25 fast ml-2013-04-10-Gender discrimination
Introduction: There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not? Congratulations! You have spotted the ceiling cat. As Sashi noticed on the forums , it’s not difficult to improve on the benchmarks a little bit. In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. Here’s our version of the story. Let’s start with normalizing. There’s a nice function for that in R, scale() . The dataset is small, 1128 examples, so we can go ahead and use R. Turns out that in its raw form, the data won’t scale. That’s because apparently there are some columns with zeros only. That makes it difficult to divide. Fortunately, we know just the right tool for the task. We learned about it on Kaggle forums too. It’s a function in caret package called nearZeroVar() . It will give you indexes of all the columns which have near zero var
6 0.10523096 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
7 0.10272651 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
8 0.10172936 19 fast ml-2013-02-07-The secret of the big guys
9 0.094692826 40 fast ml-2013-10-06-Pylearn2 in practice
10 0.093895867 35 fast ml-2013-08-12-Accelerometer Biometric Competition
11 0.090818949 30 fast ml-2013-06-01-Amazon aspires to automate access control
12 0.089571521 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
13 0.089415044 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
14 0.080019966 17 fast ml-2013-01-14-Feature selection in practice
15 0.073311172 3 fast ml-2012-09-01-Running Unix apps on Windows
16 0.064801164 46 fast ml-2013-12-07-13 NIPS papers that caught our eye
17 0.06304495 5 fast ml-2012-09-19-Best Buy mobile contest - big data
18 0.057503417 26 fast ml-2013-04-17-Regression as classification
19 0.054023739 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
20 0.053776111 8 fast ml-2012-10-15-Merck challenge
topicId topicWeight
[(0, 0.282), (1, -0.328), (2, -0.033), (3, 0.016), (4, -0.088), (5, 0.204), (6, -0.217), (7, 0.26), (8, -0.107), (9, -0.251), (10, -0.045), (11, -0.064), (12, 0.237), (13, 0.037), (14, 0.09), (15, -0.073), (16, -0.121), (17, 0.023), (18, -0.086), (19, -0.105), (20, 0.032), (21, -0.014), (22, -0.019), (23, 0.113), (24, -0.043), (25, 0.021), (26, -0.003), (27, 0.079), (28, -0.059), (29, -0.002), (30, -0.038), (31, 0.023), (32, 0.123), (33, -0.038), (34, 0.01), (35, 0.047), (36, 0.145), (37, -0.016), (38, -0.071), (39, -0.145), (40, 0.179), (41, 0.043), (42, 0.037), (43, -0.011), (44, 0.123), (45, 0.115), (46, -0.043), (47, -0.14), (48, -0.002), (49, -0.219)]
simIndex simValue blogId blogTitle
same-blog 1 0.99188989 33 fast ml-2013-07-09-Introducing phraug
Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach
2 0.73043478 32 fast ml-2013-07-05-Processing large files, line by line
Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re
3 0.37565711 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
4 0.37468657 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.
5 0.27784702 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
Introduction: Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers. To do this, we turn to topic models, an area of research with roots in natural language processing. In NLP, a training set is called a corpus, and each document is like a row in the set. A document might be three pages of text, or just a few words, as in a tweet. The idea of topic modelling is that you can group words in your corpus into relatively few topics and represent each document as a mixture of these topics. It’s attractive because you can interpret the model by looking at words that form the topics. Sometimes they seem meaningful, sometimes not. A meaningful topic might be, for example: “cric
6 0.2398302 35 fast ml-2013-08-12-Accelerometer Biometric Competition
7 0.23580578 19 fast ml-2013-02-07-The secret of the big guys
8 0.22685245 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
9 0.19250223 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
10 0.18748863 25 fast ml-2013-04-10-Gender discrimination
11 0.1821906 40 fast ml-2013-10-06-Pylearn2 in practice
12 0.17650312 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
13 0.16064064 3 fast ml-2012-09-01-Running Unix apps on Windows
14 0.15556036 17 fast ml-2013-01-14-Feature selection in practice
15 0.15179822 30 fast ml-2013-06-01-Amazon aspires to automate access control
16 0.13527435 46 fast ml-2013-12-07-13 NIPS papers that caught our eye
17 0.12710933 36 fast ml-2013-08-23-A bag of words and a nice little network
18 0.11758147 56 fast ml-2014-03-31-If you use R, you may want RStudio
19 0.10970651 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure
20 0.10851131 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
topicId topicWeight
[(31, 0.03), (55, 0.758), (69, 0.052), (71, 0.018), (81, 0.018), (99, 0.021)]
simIndex simValue blogId blogTitle
same-blog 1 0.98965716 33 fast ml-2013-07-09-Introducing phraug
Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach
2 0.56729871 32 fast ml-2013-07-05-Processing large files, line by line
Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re
3 0.31522691 20 fast ml-2013-02-18-Predicting advertised salaries
Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con
4 0.29185942 25 fast ml-2013-04-10-Gender discrimination
Introduction: There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not? Congratulations! You have spotted the ceiling cat. As Sashi noticed on the forums , it’s not difficult to improve on the benchmarks a little bit. In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. Here’s our version of the story. Let’s start with normalizing. There’s a nice function for that in R, scale() . The dataset is small, 1128 examples, so we can go ahead and use R. Turns out that in its raw form, the data won’t scale. That’s because apparently there are some columns with zeros only. That makes it difficult to divide. Fortunately, we know just the right tool for the task. We learned about it on Kaggle forums too. It’s a function in caret package called nearZeroVar() . It will give you indexes of all the columns which have near zero var
5 0.23473856 3 fast ml-2012-09-01-Running Unix apps on Windows
Introduction: When it comes to machine learning, most software seems to be in either Python, Matlab or R. Plus native apps, that is, compiled C/C++. These are fastest. Most of them is written for Unix environments, for example Linux or MacOS. So how do you run them on your computer if you have Windows installed? Back in the day, you re-partitioned your hard drive and installed Linux alongside Windows. The added thrill was, if something went wrong, your computer wouldn’t boot. Now it’s easier. You just run Linux inside Windows, using what’s called a virtual machine. You need virtualization software and a machine image to do so. Most popular software seems to be VMware. There is also VirtualBox - it is able to run VMware images. We have experience with WMware mostly, so this is what we’ll refer to. VMware player is free to download and use. There are also many images available, of various flavours of Linux and other operating systems. In fact, you can run Windows inside Linux if you wish
6 0.22658336 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit
7 0.22193173 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data
8 0.21697533 36 fast ml-2013-08-23-A bag of words and a nice little network
9 0.21444935 40 fast ml-2013-10-06-Pylearn2 in practice
10 0.20458488 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow
11 0.20314416 51 fast ml-2014-01-25-Why IPy: reasons for using IPython interactively
12 0.19581056 35 fast ml-2013-08-12-Accelerometer Biometric Competition
13 0.19486484 19 fast ml-2013-02-07-The secret of the big guys
14 0.19073559 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn
15 0.18339403 56 fast ml-2014-03-31-If you use R, you may want RStudio
16 0.18188186 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint
17 0.17236727 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet
18 0.16737549 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect
19 0.16035374 17 fast ml-2013-01-14-Feature selection in practice
20 0.15842125 39 fast ml-2013-09-19-What you wanted to know about AUC