fast_ml fast_ml-2013 fast_ml-2013-33 knowledge-graph by maker-knowledge-mining

33 fast ml-2013-07-09-Introducing phraug


meta infos for this blog

Source: html

Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Recently we proposed to pre-process large files line by line. [sent-1, score-0.488]

2 Now it’s time to introduce phraug *, a set of Python scripts based on this idea. [sent-2, score-0.584]

3 The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. [sent-3, score-0.649]

4 If you’re familiar with Unix, you may notice that some of these tasks are easily achieved using command line utilities. [sent-6, score-0.572]

5 For example, you can count lines with wc -l or see a beginning of a file with head . [sent-7, score-0.752]

6 Moreover, there are apps like sed and awk that allow for more complicated operations. [sent-8, score-0.084]

7 On Windows there are some tools, for example with more you can preview large files, but generally such functionality is mostly lacking. [sent-9, score-0.368]

8 There’s a good option to remedy this: installing Cygwin , which provides all the important command line tools from Unix, including the bash shell. [sent-10, score-0.459]

9 Still, for things like format conversion Python scripting is a good choice. [sent-11, score-0.237]

10 One reason is that you can easily modify those scripts to suit your needs. [sent-12, score-0.509]

11 We found that each project differs slightly in pre-processing requirements and we usually tweak an existing script or write a new one based on the basic template described in Processing large files, line by line . [sent-13, score-0.782]

12 The files and usage information are available at Github. [sent-14, score-0.125]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('csv', 0.397), ('file', 0.302), ('lines', 0.214), ('scripts', 0.209), ('libsvm', 0.192), ('phraug', 0.19), ('count', 0.152), ('line', 0.15), ('conversion', 0.139), ('tools', 0.129), ('files', 0.125), ('easily', 0.121), ('tasks', 0.121), ('large', 0.118), ('unix', 0.113), ('always', 0.101), ('based', 0.101), ('format', 0.098), ('command', 0.096), ('sized', 0.095), ('book', 0.095), ('shift', 0.095), ('tweak', 0.095), ('chunks', 0.095), ('similiarly', 0.095), ('cygwin', 0.095), ('proposed', 0.095), ('suit', 0.095), ('columns', 0.091), ('vowpal', 0.087), ('wabbit', 0.087), ('input', 0.087), ('apps', 0.084), ('normalize', 0.084), ('differs', 0.084), ('functionality', 0.084), ('introduce', 0.084), ('moreover', 0.084), ('beginning', 0.084), ('option', 0.084), ('operations', 0.084), ('preview', 0.084), ('specified', 0.084), ('modify', 0.084), ('continuous', 0.084), ('delete', 0.084), ('existing', 0.084), ('familiar', 0.084), ('mostly', 0.082), ('split', 0.082)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 33 fast ml-2013-07-09-Introducing phraug

Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach

2 0.2722753 32 fast ml-2013-07-05-Processing large files, line by line

Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re

3 0.17408623 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

4 0.17016833 20 fast ml-2013-02-18-Predicting advertised salaries

Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con

5 0.10777096 25 fast ml-2013-04-10-Gender discrimination

Introduction: There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not? Congratulations! You have spotted the ceiling cat. As Sashi noticed on the forums , it’s not difficult to improve on the benchmarks a little bit. In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. Here’s our version of the story. Let’s start with normalizing. There’s a nice function for that in R, scale() . The dataset is small, 1128 examples, so we can go ahead and use R. Turns out that in its raw form, the data won’t scale. That’s because apparently there are some columns with zeros only. That makes it difficult to divide. Fortunately, we know just the right tool for the task. We learned about it on Kaggle forums too. It’s a function in caret package called nearZeroVar() . It will give you indexes of all the columns which have near zero var

6 0.10523096 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

7 0.10272651 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

8 0.10172936 19 fast ml-2013-02-07-The secret of the big guys

9 0.094692826 40 fast ml-2013-10-06-Pylearn2 in practice

10 0.093895867 35 fast ml-2013-08-12-Accelerometer Biometric Competition

11 0.090818949 30 fast ml-2013-06-01-Amazon aspires to automate access control

12 0.089571521 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

13 0.089415044 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

14 0.080019966 17 fast ml-2013-01-14-Feature selection in practice

15 0.073311172 3 fast ml-2012-09-01-Running Unix apps on Windows

16 0.064801164 46 fast ml-2013-12-07-13 NIPS papers that caught our eye

17 0.06304495 5 fast ml-2012-09-19-Best Buy mobile contest - big data

18 0.057503417 26 fast ml-2013-04-17-Regression as classification

19 0.054023739 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

20 0.053776111 8 fast ml-2012-10-15-Merck challenge


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.282), (1, -0.328), (2, -0.033), (3, 0.016), (4, -0.088), (5, 0.204), (6, -0.217), (7, 0.26), (8, -0.107), (9, -0.251), (10, -0.045), (11, -0.064), (12, 0.237), (13, 0.037), (14, 0.09), (15, -0.073), (16, -0.121), (17, 0.023), (18, -0.086), (19, -0.105), (20, 0.032), (21, -0.014), (22, -0.019), (23, 0.113), (24, -0.043), (25, 0.021), (26, -0.003), (27, 0.079), (28, -0.059), (29, -0.002), (30, -0.038), (31, 0.023), (32, 0.123), (33, -0.038), (34, 0.01), (35, 0.047), (36, 0.145), (37, -0.016), (38, -0.071), (39, -0.145), (40, 0.179), (41, 0.043), (42, 0.037), (43, -0.011), (44, 0.123), (45, 0.115), (46, -0.043), (47, -0.14), (48, -0.002), (49, -0.219)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99188989 33 fast ml-2013-07-09-Introducing phraug

Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach

2 0.73043478 32 fast ml-2013-07-05-Processing large files, line by line

Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re

3 0.37565711 20 fast ml-2013-02-18-Predicting advertised salaries

Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con

4 0.37468657 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

5 0.27784702 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

Introduction: Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers. To do this, we turn to topic models, an area of research with roots in natural language processing. In NLP, a training set is called a corpus, and each document is like a row in the set. A document might be three pages of text, or just a few words, as in a tweet. The idea of topic modelling is that you can group words in your corpus into relatively few topics and represent each document as a mixture of these topics. It’s attractive because you can interpret the model by looking at words that form the topics. Sometimes they seem meaningful, sometimes not. A meaningful topic might be, for example: “cric

6 0.2398302 35 fast ml-2013-08-12-Accelerometer Biometric Competition

7 0.23580578 19 fast ml-2013-02-07-The secret of the big guys

8 0.22685245 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

9 0.19250223 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

10 0.18748863 25 fast ml-2013-04-10-Gender discrimination

11 0.1821906 40 fast ml-2013-10-06-Pylearn2 in practice

12 0.17650312 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

13 0.16064064 3 fast ml-2012-09-01-Running Unix apps on Windows

14 0.15556036 17 fast ml-2013-01-14-Feature selection in practice

15 0.15179822 30 fast ml-2013-06-01-Amazon aspires to automate access control

16 0.13527435 46 fast ml-2013-12-07-13 NIPS papers that caught our eye

17 0.12710933 36 fast ml-2013-08-23-A bag of words and a nice little network

18 0.11758147 56 fast ml-2014-03-31-If you use R, you may want RStudio

19 0.10970651 6 fast ml-2012-09-25-Best Buy mobile contest - full disclosure

20 0.10851131 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(31, 0.03), (55, 0.758), (69, 0.052), (71, 0.018), (81, 0.018), (99, 0.021)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98965716 33 fast ml-2013-07-09-Introducing phraug

Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach

2 0.56729871 32 fast ml-2013-07-05-Processing large files, line by line

Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re

3 0.31522691 20 fast ml-2013-02-18-Predicting advertised salaries

Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con

4 0.29185942 25 fast ml-2013-04-10-Gender discrimination

Introduction: There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not? Congratulations! You have spotted the ceiling cat. As Sashi noticed on the forums , it’s not difficult to improve on the benchmarks a little bit. In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. Here’s our version of the story. Let’s start with normalizing. There’s a nice function for that in R, scale() . The dataset is small, 1128 examples, so we can go ahead and use R. Turns out that in its raw form, the data won’t scale. That’s because apparently there are some columns with zeros only. That makes it difficult to divide. Fortunately, we know just the right tool for the task. We learned about it on Kaggle forums too. It’s a function in caret package called nearZeroVar() . It will give you indexes of all the columns which have near zero var

5 0.23473856 3 fast ml-2012-09-01-Running Unix apps on Windows

Introduction: When it comes to machine learning, most software seems to be in either Python, Matlab or R. Plus native apps, that is, compiled C/C++. These are fastest. Most of them is written for Unix environments, for example Linux or MacOS. So how do you run them on your computer if you have Windows installed? Back in the day, you re-partitioned your hard drive and installed Linux alongside Windows. The added thrill was, if something went wrong, your computer wouldn’t boot. Now it’s easier. You just run Linux inside Windows, using what’s called a virtual machine. You need virtualization software and a machine image to do so. Most popular software seems to be VMware. There is also VirtualBox - it is able to run VMware images. We have experience with WMware mostly, so this is what we’ll refer to. VMware player is free to download and use. There are also many images available, of various flavours of Linux and other operating systems. In fact, you can run Windows inside Linux if you wish

6 0.22658336 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

7 0.22193173 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

8 0.21697533 36 fast ml-2013-08-23-A bag of words and a nice little network

9 0.21444935 40 fast ml-2013-10-06-Pylearn2 in practice

10 0.20458488 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

11 0.20314416 51 fast ml-2014-01-25-Why IPy: reasons for using IPython interactively

12 0.19581056 35 fast ml-2013-08-12-Accelerometer Biometric Competition

13 0.19486484 19 fast ml-2013-02-07-The secret of the big guys

14 0.19073559 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

15 0.18339403 56 fast ml-2014-03-31-If you use R, you may want RStudio

16 0.18188186 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

17 0.17236727 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

18 0.16737549 48 fast ml-2013-12-28-Regularizing neural networks with dropout and with DropConnect

19 0.16035374 17 fast ml-2013-01-14-Feature selection in practice

20 0.15842125 39 fast ml-2013-09-19-What you wanted to know about AUC