fast_ml fast_ml-2013 fast_ml-2013-30 knowledge-graph by maker-knowledge-mining

30 fast ml-2013-06-01-Amazon aspires to automate access control


meta infos for this blog

Source: html

Introduction: This is about Amazon access control challenge at Kaggle. Either we’re getting smarter, or the competition is easy. Or maybe both. You can beat the benchmark quite easily and with AUC of 0.875 you’d be comfortably in the top twenty percent at the moment. We scored fourth in our first attempt - the model was quick to develop and back then there were fewer competitors. Traditionally we use Vowpal Wabbit . Just simple binary classification with the logistic loss function and 10 passes over the data. It seems to work pretty well even though the classes are very unbalanced: there’s only a handful of negatives when compared to positives. Apparently Amazon employees usually get the access they request, even though sometimes they are refused. Let’s look at the data. First a label and then a bunch of IDs. 1,39353,85475,117961,118300,123472,117905,117906,290919,117908 1,17183,1540,117961,118343,123125,118536,118536,308574,118539 1,36724,14457,118219,118220,117884,117879,267952


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This is about Amazon access control challenge at Kaggle. [sent-1, score-0.16]

2 875 you’d be comfortably in the top twenty percent at the moment. [sent-5, score-0.207]

3 We scored fourth in our first attempt - the model was quick to develop and back then there were fewer competitors. [sent-6, score-0.438]

4 Just simple binary classification with the logistic loss function and 10 passes over the data. [sent-8, score-0.823]

5 It seems to work pretty well even though the classes are very unbalanced: there’s only a handful of negatives when compared to positives. [sent-9, score-0.216]

6 Apparently Amazon employees usually get the access they request, even though sometimes they are refused. [sent-10, score-0.178]

7 1,39353,85475,117961,118300,123472,117905,117906,290919,117908 1,17183,1540,117961,118343,123125,118536,118536,308574,118539 1,36724,14457,118219,118220,117884,117879,267952,19721,117880 We will count unique values in each column. [sent-13, score-0.183]

8 We’ll skip converting features to zeros and ones for now; Vowpal Wabbit doesn’t need this. [sent-18, score-0.277]

9 We prefix IDs with underscores so that VW knows that they are strings and need to be hashed. [sent-23, score-0.135]

10 Actually, you can skip prefixing and just use numbers, and it produces similiar results. [sent-24, score-0.105]

11 We provide two Python scripts : one for converting from CSV to VW, and one for converting VW predictions to a submission format. [sent-25, score-0.344]

12 vw -c -k -f model --loss_function logistic --passes 10 vw -t -d test. [sent-35, score-1.082]

13 txt Normally we’d run VW’s predictions through a sigmoid function to get probabilities. [sent-40, score-0.129]

14 But you can do without this step here, because AUC metric only cares about ranking of predictions. [sent-41, score-0.229]

15 For the same reason you could use VW’s quantile loss function instead of the logistic. [sent-42, score-0.856]

16 VW loss functions Vowpal Wabbit supports four loss functions : squared, logistic, hinge and quantile. [sent-44, score-0.941]

17 Squared is for regression, logistic and hinge for classification, quantile for ranking. [sent-45, score-0.962]

18 This competition can be viewed either as a classification or a ranking task, so one might choose between logistic and quantile losses. [sent-46, score-1.051]

19 The downside of the quantile function is that you need to tune an additional parameter, --quantile_tau . [sent-47, score-0.646]

20 15 --passes 3 The results with logistic and quantile functions are similiar. [sent-51, score-0.936]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('quantile', 0.517), ('vw', 0.363), ('logistic', 0.273), ('loss', 0.21), ('hinge', 0.172), ('ranking', 0.172), ('converting', 0.172), ('functions', 0.146), ('squared', 0.143), ('wabbit', 0.13), ('function', 0.129), ('unique', 0.126), ('amazon', 0.126), ('skip', 0.105), ('auc', 0.103), ('vowpal', 0.098), ('fewer', 0.097), ('access', 0.097), ('classification', 0.089), ('model', 0.083), ('though', 0.081), ('column', 0.072), ('percent', 0.072), ('altogether', 0.072), ('handful', 0.072), ('action', 0.072), ('converges', 0.072), ('develop', 0.072), ('normally', 0.072), ('quick', 0.072), ('strings', 0.072), ('twenty', 0.072), ('binary', 0.065), ('length', 0.063), ('namespace', 0.063), ('control', 0.063), ('approximately', 0.063), ('comfortably', 0.063), ('knows', 0.063), ('negatives', 0.063), ('supports', 0.057), ('scored', 0.057), ('cares', 0.057), ('fourth', 0.057), ('passes', 0.057), ('count', 0.057), ('bunch', 0.053), ('ids', 0.053), ('rid', 0.053), ('originally', 0.053)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 30 fast ml-2013-06-01-Amazon aspires to automate access control

Introduction: This is about Amazon access control challenge at Kaggle. Either we’re getting smarter, or the competition is easy. Or maybe both. You can beat the benchmark quite easily and with AUC of 0.875 you’d be comfortably in the top twenty percent at the moment. We scored fourth in our first attempt - the model was quick to develop and back then there were fewer competitors. Traditionally we use Vowpal Wabbit . Just simple binary classification with the logistic loss function and 10 passes over the data. It seems to work pretty well even though the classes are very unbalanced: there’s only a handful of negatives when compared to positives. Apparently Amazon employees usually get the access they request, even though sometimes they are refused. Let’s look at the data. First a label and then a bunch of IDs. 1,39353,85475,117961,118300,123472,117905,117906,290919,117908 1,17183,1540,117961,118343,123125,118536,118536,308574,118539 1,36724,14457,118219,118220,117884,117879,267952

2 0.31652451 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

3 0.22879194 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit

Introduction: Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are: a neural network with a single hidden layer automatic creation of polynomial, specifically quadratic and cubic, features N-grams We describe how to use them, providing examples from the Kaggle Amazon competition and for the kin8nm dataset. Neural network The original motivation for creating neural network code in VW was to win some Kaggle competitions using only vee-dub , and that goal becomes much more feasible once you have a strong non-linear learner. The network seems to be a classic multi-layer perceptron with one sigmoidal hidden layer. More interestingly, it has dropout. Unfortunately, in a few tries we haven’t had much luck with the dropout. Here’s an example of how to create a network with 10 hidden units: vw -d data.vw --nn 10 Quadratic and cubic features The idea of quadratic features is to create all possible combinations between original features, so that

4 0.17526704 20 fast ml-2013-02-18-Predicting advertised salaries

Introduction: We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough? Congratulations! You have spotted the ceiling cat. A linear model better than a random forest - how so? Well, to train a random forest on data this big, the benchmark code extracts only 100 most common words as features, and we will use all. This approach is similiar to the one we applied in Merck challenge . More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway). The competition is about predicting salaries from job adverts. Of course the figures usually appear in the text, so they were removed. An error metric is mean absolute error (MAE) - how refreshing to see so intuitive one. The data for Job salary prediction con

5 0.17275475 26 fast ml-2013-04-17-Regression as classification

Introduction: An interesting development occured in Job salary prediction at Kaggle: the guy who ranked 3rd used logistic regression , in spite of the task being regression, not classification. We attempt to replicate the experiment. The idea is to discretize salaries into a number of bins, just like with a histogram. Guocong Song , the man, used 30 bins. We like a convenient uniform bin width of 0.1, as a minimum log salary in the training set is 8.5 and a maximum is 12.2. Since there are few examples in the high end, we stop at 12.0, so that gives us 36 bins. Here’s the code: import numpy as np min_salary = 8.5 max_salary = 12.0 interval = 0.1 a_range = np.arange( min_salary, max_salary + interval, interval ) class_mapping = {} for i, n in enumerate( a_range ): n = round( n, 1 ) class_mapping[n] = i + 1 This way we get a mapping from log salaries to classes. Class labels start with 1, because Vowpal Wabbit expects that, and we intend to use VW. The code can be

6 0.13735346 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition

7 0.11582497 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

8 0.090818949 33 fast ml-2013-07-09-Introducing phraug

9 0.081540711 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

10 0.081505194 39 fast ml-2013-09-19-What you wanted to know about AUC

11 0.076775931 40 fast ml-2013-10-06-Pylearn2 in practice

12 0.070440635 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

13 0.063207753 19 fast ml-2013-02-07-The secret of the big guys

14 0.057846189 25 fast ml-2013-04-10-Gender discrimination

15 0.05533918 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

16 0.054356877 8 fast ml-2012-10-15-Merck challenge

17 0.053041905 50 fast ml-2014-01-20-How to get predictions from Pylearn2

18 0.052834287 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

19 0.050405007 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

20 0.048324045 27 fast ml-2013-05-01-Deep learning made easy


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.296), (1, -0.403), (2, -0.041), (3, 0.249), (4, 0.039), (5, -0.069), (6, 0.224), (7, -0.177), (8, 0.1), (9, 0.11), (10, -0.022), (11, 0.018), (12, -0.039), (13, 0.048), (14, 0.083), (15, 0.042), (16, -0.076), (17, -0.045), (18, -0.137), (19, 0.08), (20, -0.055), (21, 0.03), (22, 0.027), (23, 0.095), (24, -0.02), (25, 0.094), (26, -0.053), (27, 0.037), (28, 0.079), (29, 0.154), (30, -0.173), (31, -0.096), (32, 0.139), (33, 0.013), (34, -0.04), (35, -0.06), (36, -0.165), (37, 0.122), (38, -0.145), (39, -0.125), (40, 0.08), (41, 0.006), (42, 0.011), (43, -0.026), (44, -0.08), (45, -0.085), (46, -0.038), (47, -0.088), (48, -0.191), (49, 0.14)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9839617 30 fast ml-2013-06-01-Amazon aspires to automate access control

Introduction: This is about Amazon access control challenge at Kaggle. Either we’re getting smarter, or the competition is easy. Or maybe both. You can beat the benchmark quite easily and with AUC of 0.875 you’d be comfortably in the top twenty percent at the moment. We scored fourth in our first attempt - the model was quick to develop and back then there were fewer competitors. Traditionally we use Vowpal Wabbit . Just simple binary classification with the logistic loss function and 10 passes over the data. It seems to work pretty well even though the classes are very unbalanced: there’s only a handful of negatives when compared to positives. Apparently Amazon employees usually get the access they request, even though sometimes they are refused. Let’s look at the data. First a label and then a bunch of IDs. 1,39353,85475,117961,118300,123472,117905,117906,290919,117908 1,17183,1540,117961,118343,123125,118536,118536,308574,118539 1,36724,14457,118219,118220,117884,117879,267952

2 0.76362008 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

3 0.40329283 31 fast ml-2013-06-19-Go non-linear with Vowpal Wabbit

Introduction: Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are: a neural network with a single hidden layer automatic creation of polynomial, specifically quadratic and cubic, features N-grams We describe how to use them, providing examples from the Kaggle Amazon competition and for the kin8nm dataset. Neural network The original motivation for creating neural network code in VW was to win some Kaggle competitions using only vee-dub , and that goal becomes much more feasible once you have a strong non-linear learner. The network seems to be a classic multi-layer perceptron with one sigmoidal hidden layer. More interestingly, it has dropout. Unfortunately, in a few tries we haven’t had much luck with the dropout. Here’s an example of how to create a network with 10 hidden units: vw -d data.vw --nn 10 Quadratic and cubic features The idea of quadratic features is to create all possible combinations between original features, so that

4 0.3018612 29 fast ml-2013-05-25-More on sparse filtering and the Black Box competition

Introduction: The Black Box challenge has just ended. We were thoroughly thrilled to learn that the winner, doubleshot , used sparse filtering, apparently following our cue. His score in terms of accuracy is 0.702, ours 0.645, and the best benchmark 0.525. We ranked 15th out of 217, a few places ahead of the Toronto team consisting of Charlie Tang and Nitish Srivastava . To their credit, Charlie has won the two remaining Challenges in Representation Learning . Not-so-deep learning The difference to our previous, beating-the-benchmark attempt is twofold: one layer instead of two for supervised learning, VW instead of a random forest Somewhat suprisingly, one layer works better than two. Even more surprisingly, with enough units you can get 0.634 using a linear model (Vowpal Wabbit, of course, One-Against-All). In our understanding, that’s the point of overcomplete representations*, which Stanford people seem to care much about. Recall The secret of the big guys and the pape

5 0.2954345 26 fast ml-2013-04-17-Regression as classification

Introduction: An interesting development occured in Job salary prediction at Kaggle: the guy who ranked 3rd used logistic regression , in spite of the task being regression, not classification. We attempt to replicate the experiment. The idea is to discretize salaries into a number of bins, just like with a histogram. Guocong Song , the man, used 30 bins. We like a convenient uniform bin width of 0.1, as a minimum log salary in the training set is 8.5 and a maximum is 12.2. Since there are few examples in the high end, we stop at 12.0, so that gives us 36 bins. Here’s the code: import numpy as np min_salary = 8.5 max_salary = 12.0 interval = 0.1 a_range = np.arange( min_salary, max_salary + interval, interval ) class_mapping = {} for i, n in enumerate( a_range ): n = round( n, 1 ) class_mapping[n] = i + 1 This way we get a mapping from log salaries to classes. Class labels start with 1, because Vowpal Wabbit expects that, and we intend to use VW. The code can be

6 0.24121314 20 fast ml-2013-02-18-Predicting advertised salaries

7 0.19743355 39 fast ml-2013-09-19-What you wanted to know about AUC

8 0.18001878 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

9 0.16969174 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

10 0.15387006 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

11 0.1519897 25 fast ml-2013-04-10-Gender discrimination

12 0.14932965 52 fast ml-2014-02-02-Yesterday a kaggler, today a Kaggle master: a wrap-up of the cats and dogs competition

13 0.14283724 40 fast ml-2013-10-06-Pylearn2 in practice

14 0.13952014 19 fast ml-2013-02-07-The secret of the big guys

15 0.13336807 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

16 0.12687665 50 fast ml-2014-01-20-How to get predictions from Pylearn2

17 0.12203131 36 fast ml-2013-08-23-A bag of words and a nice little network

18 0.11947074 33 fast ml-2013-07-09-Introducing phraug

19 0.10556 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

20 0.10438017 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(26, 0.616), (31, 0.053), (35, 0.015), (55, 0.029), (69, 0.095), (71, 0.025), (81, 0.02), (99, 0.039)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9706459 30 fast ml-2013-06-01-Amazon aspires to automate access control

Introduction: This is about Amazon access control challenge at Kaggle. Either we’re getting smarter, or the competition is easy. Or maybe both. You can beat the benchmark quite easily and with AUC of 0.875 you’d be comfortably in the top twenty percent at the moment. We scored fourth in our first attempt - the model was quick to develop and back then there were fewer competitors. Traditionally we use Vowpal Wabbit . Just simple binary classification with the logistic loss function and 10 passes over the data. It seems to work pretty well even though the classes are very unbalanced: there’s only a handful of negatives when compared to positives. Apparently Amazon employees usually get the access they request, even though sometimes they are refused. Let’s look at the data. First a label and then a bunch of IDs. 1,39353,85475,117961,118300,123472,117905,117906,290919,117908 1,17183,1540,117961,118343,123125,118536,118536,308574,118539 1,36724,14457,118219,118220,117884,117879,267952

2 0.96736789 38 fast ml-2013-09-09-Predicting solar energy from weather forecasts plus a NetCDF4 tutorial

Introduction: Kaggle again. This time, solar energy prediction. We will show how to get data out of NetCDF4 files in Python and then beat the benchmark. The goal of this competition is to predict solar energy at Oklahoma Mesonet stations (red dots) from weather forecasts for GEFS points (blue dots): We’re getting a number of NetCDF files, each holding info about one variable, like expected temperature or precipitation. These variables have a few dimensions: time: the training set contains information about 5113 consecutive days. For each day there are five forecasts for different hours. location: location is described by latitude and longitude of GEFS points weather models: there are 11 forecasting models, called ensembles NetCDF4 tutorial The data is in NetCDF files, binary format apparently popular for storing scientific data. We will access it from Python using netcdf4-python . To use it, you will need to install HDF5 and NetCDF4 libraries first. If you’re on Wind

3 0.45160234 7 fast ml-2012-10-05-Predicting closed questions on Stack Overflow

Introduction: This time we enter the Stack Overflow challenge , which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem. We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit , and this new version supports multiclass classification. In case you’re wondering, Vowpal Wabbit is a fast linear learner. We like the “fast” part and “linear” is OK for dealing with lots of words, as in this contest. In any case, with more than three million data points it wouldn’t be that easy to train a kernel SVM, a neural net or what have you. VW, being a well-polished tool, has a few very convenient features.

4 0.3872731 21 fast ml-2013-02-27-Dimensionality reduction for sparse binary data

Introduction: Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers. To do this, we turn to topic models, an area of research with roots in natural language processing. In NLP, a training set is called a corpus, and each document is like a row in the set. A document might be three pages of text, or just a few words, as in a tweet. The idea of topic modelling is that you can group words in your corpus into relatively few topics and represent each document as a mixture of these topics. It’s attractive because you can interpret the model by looking at words that form the topics. Sometimes they seem meaningful, sometimes not. A meaningful topic might be, for example: “cric

5 0.37879187 23 fast ml-2013-03-18-Large scale L1 feature selection with Vowpal Wabbit

Introduction: The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples. We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set. What you wanted to know about Lasso and Ridge L1 and L2 are both ways of regularization sometimes called weight decay . Basically, we include parameter weights in a cost function. In effect, the model will try to minimize those weights by going “down the slope”. Example weights: in a linear model or in a neural network. L1 is known as Lasso and L2 is known as Ridge. These names may be confusing, because a chart of Lasso looks like a ridge and a

6 0.37030837 40 fast ml-2013-10-06-Pylearn2 in practice

7 0.36958039 26 fast ml-2013-04-17-Regression as classification

8 0.3558498 61 fast ml-2014-05-08-Impute missing values with Amelia

9 0.35397321 39 fast ml-2013-09-19-What you wanted to know about AUC

10 0.34469891 20 fast ml-2013-02-18-Predicting advertised salaries

11 0.33232161 60 fast ml-2014-04-30-Converting categorical data into numbers with Pandas and Scikit-learn

12 0.33018887 25 fast ml-2013-04-10-Gender discrimination

13 0.32648164 45 fast ml-2013-11-27-Object recognition in images with cuda-convnet

14 0.32538429 24 fast ml-2013-03-25-Dimensionality reduction for sparse binary data - an overview

15 0.30182686 34 fast ml-2013-07-14-Running things on a GPU

16 0.30064744 12 fast ml-2012-12-21-Tuning hyperparams automatically with Spearmint

17 0.29732022 36 fast ml-2013-08-23-A bag of words and a nice little network

18 0.29715249 17 fast ml-2013-01-14-Feature selection in practice

19 0.29426169 55 fast ml-2014-03-20-Good representations, distance, metric learning and supervised dimensionality reduction

20 0.28534618 18 fast ml-2013-01-17-A very fast denoising autoencoder