fast_ml fast_ml-2013 fast_ml-2013-32 knowledge-graph by maker-knowledge-mining

32 fast ml-2013-07-05-Processing large files, line by line

meta infos for this blog

Source: html

Introduction: Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code. The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python. import csv i_f = open( input_file, 'r' ) reader = csv.reader( i_f ) For libsvm you just set the delimiter to space: reader = csv.reader( i_f, delimiter = ' ' ) Then you go over the file contents. Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # .... writer.writerow( line ) If you need to do a second pass, you just rewind the input file: i_f.seek( 0 ) for line in re

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 import csv i_f = open( input_file, 'r' ) reader = csv. [sent-10, score-0.437]

2 reader( i_f ) For libsvm you just set the delimiter to space: reader = csv. [sent-11, score-0.438]

3 Each line is a list of strings: for line in reader: # do something with the line, for example: label = float( line[0] ) # . [sent-13, score-0.69]

4 writerow( line ) If you need to do a second pass, you just rewind the input file: i_f. [sent-18, score-0.362]

5 It’s purpose is to randomly split the file into two, such that some lines from the original file go to the first ouput file and the rest go to the second. [sent-23, score-1.037]

6 csv We’d also like to specify the probability of writing to the first file, so that for example 90% go to the train set and the rest to the test set: python split. [sent-29, score-0.473]

7 We need to import a few modules and read the file names: import csv import sys import random input_file = sys. [sent-35, score-0.902]

8 9 Random seed It might be useful to be able to split the file again in the future in exactly the same way. [sent-41, score-0.974]

9 To get the same split every time, we’ll seed a random number generator. [sent-45, score-0.773]

10 You give it a seed - an arbitrary string - and it will behave exactly the same every time. [sent-46, score-0.637]

11 So we’d like to be able to specify a seed on the command line as the final argument: python split. [sent-48, score-1.006]

12 argv[5] except IndexError: seed = None if seed: random. [sent-54, score-0.571]

13 seed( seed ) Readers and writers Let’s open the files and create a CSV reader and two writers. [sent-55, score-1.121]

14 If you’re on Windows, it’s important to open the files for writing in binary mode ( 'wb' ), otherwise you might get some extra new lines. [sent-56, score-0.524]

15 write( line ) The headers Some files have headers in the first line. [sent-62, score-0.856]

16 If that’s the case with your data, you can read the first line and write it to both output files: headers = reader. [sent-63, score-0.727]

17 We read a line and then (pseudo)randomly decide whether to write it to one file or the other. [sent-68, score-0.783]

18 We compare this number to P and on this basis decide which file to write to. [sent-72, score-0.452]

19 writerow( line ) Note that it’s an inexact method of splitting: if you have a thousand lines, you’ll get roughly 900 in the first file and roughly 100 in the second with P = 0. [sent-76, score-0.661]

20 An exact split If you wanted to split the file exactly 900/100, here’s a way to do it. [sent-78, score-0.587]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('seed', 0.49), ('line', 0.289), ('reader', 0.285), ('open', 0.229), ('headers', 0.225), ('file', 0.191), ('split', 0.155), ('csv', 0.152), ('import', 0.152), ('indexerror', 0.122), ('specify', 0.122), ('files', 0.117), ('memory', 0.115), ('write', 0.11), ('read', 0.103), ('delimiter', 0.102), ('integers', 0.102), ('shuffle', 0.102), ('xrange', 0.102), ('indexes', 0.091), ('module', 0.09), ('decide', 0.09), ('float', 0.09), ('exactly', 0.086), ('except', 0.081), ('pass', 0.081), ('writing', 0.075), ('range', 0.075), ('second', 0.073), ('format', 0.07), ('random', 0.067), ('go', 0.067), ('really', 0.065), ('every', 0.061), ('compare', 0.061), ('randomly', 0.061), ('something', 0.058), ('rest', 0.057), ('lines', 0.057), ('list', 0.054), ('roughly', 0.054), ('final', 0.054), ('might', 0.052), ('otherwise', 0.051), ('command', 0.051), ('libsvm', 0.051), ('common', 0.051), ('processed', 0.051), ('readers', 0.051), ('formats', 0.051)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 32 fast ml-2013-07-05-Processing large files, line by line

2 0.2722753 33 fast ml-2013-07-09-Introducing phraug

Introduction: Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug *, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning. With phraug you currently can convert from one format to another: csv to libsvm csv to Vowpal Wabbit libsvm to csv libsvm to Vowpal Wabbit tsv to csv And perform some other file operations: count lines in a file sample lines from a file split a file into two randomly split a file into a number of similiarly sized chunks save a continuous subset of lines from a file (for example, first 100) delete specified columns from a csv file normalize (shift and scale) columns in a csv file Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged. If you’re familiar with Unix, you may notice that some of these tasks are easily ach

3 0.14601281 5 fast ml-2012-09-19-Best Buy mobile contest - big data

Introduction: Last time we talked about the small data branch of Best Buy contest . Now it’s time to tackle the big boy . It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum , and also in small number of participating teams: so far, only six contestants managed to beat the benchmark. But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news. The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details. This is the very same script , with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before

4 0.133012 25 fast ml-2013-04-10-Gender discrimination

Introduction: There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not? Congratulations! You have spotted the ceiling cat. As Sashi noticed on the forums , it’s not difficult to improve on the benchmarks a little bit. In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. Here’s our version of the story. Let’s start with normalizing. There’s a nice function for that in R, scale() . The dataset is small, 1128 examples, so we can go ahead and use R. Turns out that in its raw form, the data won’t scale. That’s because apparently there are some columns with zeros only. That makes it difficult to divide. Fortunately, we know just the right tool for the task. We learned about it on Kaggle forums too. It’s a function in caret package called nearZeroVar() . It will give you indexes of all the columns which have near zero var

5 0.126537 20 fast ml-2013-02-18-Predicting advertised salaries

Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con

6 0.1139646 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

7 0.10634075 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

8 0.1050252 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

9 0.084932178 19 fast ml-2013-02-07-The secret of the big guys

10 0.081435159 43 fast ml-2013-11-02-Maxing out the digits

11 0.079984538 40 fast ml-2013-10-06-Pylearn2 in practice

12 0.077358507 36 fast ml-2013-08-23-A bag of words and a nice little network

13 0.075809844 50 fast ml-2014-01-20-How to get predictions from Pylearn2

14 0.074310973 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

15 0.073227532 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

16 0.066154458 17 fast ml-2013-01-14-Feature selection in practice

17 0.06519901 14 fast ml-2013-01-04-Madelon: Spearmint's revenge

18 0.064269744 8 fast ml-2012-10-15-Merck challenge

19 0.060260996 26 fast ml-2013-04-17-Regression as classification

20 0.057261337 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.286), (1, -0.237), (2, -0.054), (3, -0.108), (4, -0.069), (5, 0.151), (6, -0.242), (7, 0.345), (8, -0.12), (9, -0.334), (10, -0.069), (11, -0.007), (12, 0.083), (13, -0.123), (14, -0.005), (15, -0.067), (16, -0.115), (17, 0.0), (18, -0.032), (19, 0.026), (20, 0.024), (21, 0.129), (22, -0.039), (23, 0.085), (24, 0.05), (25, -0.039), (26, 0.012), (27, -0.037), (28, -0.063), (29, 0.07), (30, -0.168), (31, 0.004), (32, -0.074), (33, -0.018), (34, 0.08), (35, -0.008), (36, 0.118), (37, 0.006), (38, 0.05), (39, 0.079), (40, 0.161), (41, 0.096), (42, 0.045), (43, 0.135), (44, 0.098), (45, -0.089), (46, -0.127), (47, 0.022), (48, 0.008), (49, 0.053)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98329264 32 fast ml-2013-07-05-Processing large files, line by line

2 0.72779876 33 fast ml-2013-07-09-Introducing phraug

3 0.42022592 5 fast ml-2012-09-19-Best Buy mobile contest - big data

4 0.31849331 20 fast ml-2013-02-18-Predicting advertised salaries

5 0.26904345 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

Introduction: The promise What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human): data preparation model tuning This story is about model tuning. Typically, to achieve satisfactory results, first we need to convert raw data into format accepted by the model we would like to use, and then tune a few hyperparameters of the model. For example, some hyperparams to tune for a random forest may be a number of trees to grow and a number of candidate features at each split ( mtry in R randomForest). For a neural network, there are quite a lot of hyperparams: number of layers, number of neurons in each layer (specifically, in each hid

6 0.26686984 25 fast ml-2013-04-10-Gender discrimination

7 0.23187162 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

8 0.21806423 36 fast ml-2013-08-23-A bag of words and a nice little network

9 0.21545543 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

10 0.20570666 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

11 0.18600222 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

12 0.17561388 50 fast ml-2014-01-20-How to get predictions from Pylearn2

13 0.16741987 40 fast ml-2013-10-06-Pylearn2 in practice

14 0.16183212 19 fast ml-2013-02-07-The secret of the big guys

15 0.16086699 43 fast ml-2013-11-02-Maxing out the digits

16 0.1489483 47 fast ml-2013-12-15-A-B testing with bayesian bandits in Google Analytics

17 0.14411791 22 fast ml-2013-03-07-Choosing a machine learning algorithm

18 0.13921434 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial

19 0.13749447 14 fast ml-2013-01-04-Madelon: Spearmint's revenge

20 0.13530117 27 fast ml-2013-05-01-Deep learning made easy

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(26, 0.016), (31, 0.02), (37, 0.027), (43, 0.375), (55, 0.151), (58, 0.015), (69, 0.151), (71, 0.037), (73, 0.013), (79, 0.022), (81, 0.014), (99, 0.051)]