jmlr jmlr2013 jmlr2013-19 knowledge-graph by maker-knowledge-mining

19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

Source: pdf

Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang

Abstract: We present BudgetedSVM, an open-source C++ toolbox comprising highly-optimized implementations of recently proposed algorithms for scalable training of Support Vector Machine (SVM) approximators: Adaptive Multi-hyperplane Machines, Low-rank Linearization SVM, and Budgeted Stochastic Gradient Descent. BudgetedSVM trains models with accuracy comparable to LibSVM in time comparable to LibLinear, solving non-linear problems with millions of high-dimensional examples within minutes on a regular computer. We provide command-line and Matlab interfaces to BudgetedSVM, an efﬁcient API for handling large-scale, high-dimensional data sets, as well as detailed documentation to help developers use and further extend the toolbox. Keywords: non-linear classiﬁcation, large-scale learning, SVM, machine learning toolbox

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 BudgetedSVM trains models with accuracy comparable to LibSVM in time comparable to LibLinear, solving non-linear problems with millions of high-dimensional examples within minutes on a regular computer. [sent-7, score-0.169]

2 We provide command-line and Matlab interfaces to BudgetedSVM, an efﬁcient API for handling large-scale, high-dimensional data sets, as well as detailed documentation to help developers use and further extend the toolbox. [sent-8, score-0.151]

3 Keywords: non-linear classiﬁcation, large-scale learning, SVM, machine learning toolbox 1. [sent-9, score-0.15]

4 Kernel SVMs deliver state-of-the-art accuracies on non-linear problems, but are characterized by linear growth in the number of support vectors with data size, which may prevent learning from truly large data. [sent-11, score-0.036]

5 In contrast, linear SVMs cannot capture non-linear concepts, but are very scalable and allow learning from large data with limited resources. [sent-12, score-0.074]

6 Aimed at bridging the representability and scalability gap between linear and non-linear SVMs, recent advances in large-scale learning resulted in powerful algorithms that enable scalable training of non-linear SVMs, such as Adaptive Multi-hyperplane Machines (AMM) (Wang et al. [sent-13, score-0.233]

7 With accuracies comparable to kernel SVM, the algorithms are scalable to millions of examples, having training and inference times comparable to linear and orders of magnitude shorter than kernel SVM. [sent-17, score-0.461]

8 We present BudgetedSVM, an open-source C++ toolbox for scalable non-linear classiﬁcation. [sent-18, score-0.224]

9 The toolbox provides an Application Programming Interface (API) for efﬁcient training and testing of non-linear classiﬁers, supported by data structures designed for handling data which cannot ﬁt in memory. [sent-19, score-0.264]

10 , 2008; Chang and Lin, 2011), combining the efﬁciency of linear with the accuracy of kernel SVM. [sent-21, score-0.078]

11 (2011) proposed a classiﬁer that captures non-linearity by assigning a number of linear hyperplanes to each of C classes from a set Y . [sent-28, score-0.049]

12 Given a D-dimensional example x, the AMM multiclass classiﬁer has the following form, f (x) = arg max g(i, x), where g(i, x) = max wTj x, i j=1,. [sent-29, score-0.032]

13 ,bi i∈Y (1) where the ith class is assigned bi weight vectors with the total budget B = ∑i bi . [sent-32, score-0.12]

14 The hyper-parameters include a regularization parameter λ, the number of training epochs e, the maximum number of non-zero weights per class Blim , bi ≤ Blim , and weight pruning parameters k (pruning frequency) and c (pruning aggressiveness). [sent-34, score-0.192]

15 As an initial guideline to the users, we experimentally found that for most data sets the values e = 5 (or e = 1 for very large data), Blim = 50, k = 10,000, and c = 10 are appropriate choices, leaving only λ to be determined by cross-validation. [sent-35, score-0.03]

16 , bC are ﬁxed to 1, the AMM model reduces to linear multi-class SVM (Crammer and Singer, 2002), and the learning algorithm is equivalent to Pegasos, a popular linear SVM solver (Shalev-Shwartz et al. [sent-39, score-0.052]

17 As it is a widely-used linear SVM solver, we also provide the Pegasos algorithm directly as a shortcut in the BudgetedSVM toolbox. [sent-41, score-0.03]

18 (2012) proposed to approximate kernel SVM optimization by a linear SVM using low-rank decomposition of the kernel matrix. [sent-44, score-0.156]

19 ,B is a set of landmark points of size B, k(x, zi ) is a kernel function, w deﬁnes a separating hyperplane in the linearized kernel space (found using the DCD method), and M is a B × B mapping matrix. [sent-53, score-0.205]

20 The hyper-parameters include kernel parameters, regularization parameter λ, and the number of landmark points B. [sent-54, score-0.127]

21 Parameter B controls a trade-off between speed and accuracy, while kernel parameters and λ are best determined by cross-validation. [sent-55, score-0.078]

22 (2012) proposed a budgeted algorithm which maintains a ﬁxed number of support vectors in the model, and incrementally updates them during the SGD training. [sent-58, score-0.246]

23 We implemented Pegasos-style training, where the budget is maintained through either merging (where RBF kernel is used) or random removal of support vectors. [sent-63, score-0.14]

24 The hyper-parameters include the number of epochs e, kernel parameters, regularization parameter λ, and budget size B. [sent-64, score-0.193]

25 Parameters B and e control a speed-accuracy trade-off, while kernel parameters and λ are best determined by cross-validation. [sent-65, score-0.078]

26 Parameter I for SVM with RBF kernel (RBF-SVM) denotes a number of training iterations, empirically shown to be super-linear in N (Chang and Lin, 2011). [sent-68, score-0.14]

27 The software package provides a C++ API, comprising functions for training and testing of non-linear models described in Section 2. [sent-74, score-0.103]

28 Each model can be easily trained and tested by calling the corresponding train/predict function, deﬁned in mm algs. [sent-75, score-0.067]

29 The API also provides functions for handling large-scale, high-dimensional data, deﬁned in budgetedSVM. [sent-79, score-0.052]

30 BudgetedSVM sequentially loads data chunks into memory to allow large-scale training, storing to memory only indices and values of non-zero features as a linked list. [sent-81, score-0.288]

31 Furthermore, implementation of sparse vectors is optimized for high-dimensional data, allowing faster kernel computations and faster updates of hyperplanes and support vectors than linked list (e. [sent-82, score-0.162]

32 , as in LibSVM) or array implementation of vectors (e. [sent-84, score-0.041]

33 , as in MSVMpack by Lauer and Guermeur, 2011) used for regularscale problems, where either time or memory costs can become prohibitively large during training in a large-scale setting. [sent-86, score-0.116]

34 In particular, vectors are split into disjoint chunks where pointers to each chunk are stored in an array, and memory for a chunk is allocated only if one of its elements is nonzero. [sent-87, score-0.276]

35 While signiﬁcantly reducing time costs, we empirically found that this approach incurs very limited memory overhead even for data with millions of features. [sent-88, score-0.133]

36 Moreover, by storing and incrementally updating support vector ℓ2 -norms after each training step, time to compute popular kernels (e. [sent-90, score-0.133]

37 16 18 4m webspam N = 280,000 D = 254 rcv1 N = 677,399 D = 47,236 mnist8m-bin N = 8,000,000 D = 784 e. [sent-111, score-0.069]

38 ) on benchmark data sets We also provide command-line and Matlab interfaces for easier use of the toolbox, which follow the user-friendly format of LibSVM and LibLinear. [sent-167, score-0.069]

39 txt ﬁle and the trained model is stored to the a9a model. [sent-172, score-0.098]

40 txt to evaluate the trained model, which loads the testing data from a9a test. [sent-177, score-0.126]

41 1 Performance Comparison The BudgetedSVM toolbox can learn an accurate model even for data with millions of examples and features, with training times orders of magnitude faster than RBF-SVM trained using LibSVM. [sent-183, score-0.358]

42 For illustration, in Table 2 we give comparison of error rates and training times on binary classiﬁcation tasks using several large-scale data sets (Wang et al. [sent-184, score-0.062]

43 On webspam and rcv1 it took LibSVM hours to train RBF-SVM, while BudgetedSVM algorithms with much smaller budgets achieved high accuracy within minutes, and even seconds in the case of AMM. [sent-186, score-0.158]

44 Similarly, RBF-SVM training on large-scale mnist8m-bin could not be completed in a reasonable time on our test machine, while the implemented algorithms were trained within a few hours on extremely limited budgets to achieve low error rates. [sent-187, score-0.218]

45 Using optimized functions and data structures as a basis, through our and community efforts we plan to add more classiﬁers, such as Tighter Perceptron (Wang and Vucetic, 2009) and BPA (Wang and Vucetic, 2010), to make BudgetedSVM a more inclusive toolbox of budgeted SVM approximations. [sent-191, score-0.358]

46 Listed accuracy was obtained after 2 days of P-packSVM training on 512 processors (Zhu et al. [sent-194, score-0.062]

47 On the algorithmic implementation of multiclass kernel-based vector machines. [sent-214, score-0.032]

48 A dual coordinate descent method for large-scale linear SVM. [sent-226, score-0.091]

49 Tighter perceptron with improved dual use of cached data for model representation and validation. [sent-242, score-0.099]

50 Trading representability for scalability: Adaptive multi-hyperplane machine for nonlinear classiﬁcation. [sent-254, score-0.069]

51 Breaking the curse of kernelization: Budgeted stochastic gradient descent for large-scale SVM training. [sent-260, score-0.138]

52 Scaling up kernel SVM on limited resources: A lowrank linearization approach. [sent-267, score-0.207]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('budgetedsvm', 0.589), ('amm', 0.312), ('bsgd', 0.208), ('budgeted', 0.208), ('vucetic', 0.173), ('svm', 0.173), ('toolbox', 0.15), ('pegasos', 0.134), ('wang', 0.122), ('temple', 0.107), ('blim', 0.104), ('djuric', 0.104), ('llsvm', 0.104), ('ncs', 0.104), ('nemanja', 0.104), ('linearization', 0.099), ('libsvm', 0.094), ('api', 0.092), ('crammer', 0.087), ('millions', 0.079), ('kernel', 0.078), ('scalable', 0.074), ('lan', 0.074), ('svms', 0.071), ('chunk', 0.069), ('dcd', 0.069), ('juric', 0.069), ('lauer', 0.069), ('msvmpack', 0.069), ('nsb', 0.069), ('representability', 0.069), ('slobodan', 0.069), ('ucetic', 0.069), ('udgeted', 0.069), ('webspam', 0.069), ('zhuang', 0.069), ('interfaces', 0.069), ('trained', 0.067), ('hsieh', 0.065), ('descent', 0.063), ('training', 0.062), ('budget', 0.062), ('budgets', 0.059), ('loads', 0.059), ('memory', 0.054), ('chunks', 0.053), ('epochs', 0.053), ('sgd', 0.053), ('handling', 0.052), ('solver', 0.052), ('chang', 0.049), ('liang', 0.049), ('hyperplanes', 0.049), ('ibm', 0.049), ('landmark', 0.049), ('pruning', 0.048), ('classi', 0.046), ('liblinear', 0.046), ('sv', 0.046), ('gradient', 0.042), ('perceptron', 0.041), ('comprising', 0.041), ('sb', 0.041), ('array', 0.041), ('zhu', 0.039), ('incrementally', 0.038), ('rbf', 0.038), ('er', 0.037), ('minutes', 0.036), ('db', 0.036), ('accuracies', 0.036), ('linked', 0.035), ('matlab', 0.034), ('adaptive', 0.033), ('stochastic', 0.033), ('storing', 0.033), ('le', 0.033), ('machines', 0.033), ('multiclass', 0.032), ('primal', 0.032), ('stored', 0.031), ('hours', 0.03), ('approximators', 0.03), ('lowrank', 0.03), ('aggressiveness', 0.03), ('cached', 0.03), ('developer', 0.03), ('developers', 0.03), ('guideline', 0.03), ('pike', 0.03), ('services', 0.03), ('shortcut', 0.03), ('complexities', 0.029), ('bi', 0.029), ('scalability', 0.028), ('ers', 0.028), ('lin', 0.028), ('dual', 0.028), ('comparable', 0.027), ('pa', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang

2 0.083019242 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

Author: Yu-Feng Li, Ivor W. Tsang, James T. Kwok, Zhi-Hua Zhou

Abstract: In this paper, we study the problem of learning from weakly labeled data, where labels of the training examples are incomplete. This includes, for example, (i) semi-supervised learning where labels are partially known; (ii) multi-instance learning where labels are implicitly known; and (iii) clustering where labels are completely unknown. Unlike supervised learning, learning with weak labels involves a difﬁcult Mixed-Integer Programming (MIP) problem. Therefore, it can suffer from poor scalability and may also get stuck in local minimum. In this paper, we focus on SVMs and propose the W ELL SVM via a novel label generation strategy. This leads to a convex relaxation of the original MIP, which is at least as tight as existing convex Semi-Deﬁnite Programming (SDP) relaxations. Moreover, the W ELL SVM can be solved via a sequence of SVM subproblems that are much more scalable than previous convex SDP relaxations. Experiments on three weakly labeled learning tasks, namely, (i) semi-supervised learning; (ii) multi-instance learning for locating regions of interest in content-based information retrieval; and (iii) clustering, clearly demonstrate improved performance, and W ELL SVM is also readily applicable on large data sets. Keywords: weakly labeled data, semi-supervised learning, multi-instance learning, clustering, cutting plane, convex relaxation

3 0.073936328 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

Author: David Picard, Nicolas Thome, Matthieu Cord

Abstract: JKernelMachines is a Java library for learning with kernels. It is primarily designed to deal with custom kernels that are not easily found in standard libraries, such as kernels on structured data. These types of kernels are often used in computer vision or bioinformatics applications. We provide several kernels leading to state of the art classiﬁcation performances in computer vision, as well as various kernels on sets. The main focus of the library is to be easily extended with new kernels. Standard SVM optimization algorithms are available, but also more sophisticated learning-based kernel combination methods such as Multiple Kernel Learning (MKL), and a recently published algorithm to learn powered products of similarities (Product Kernel Learning). Keywords: classiﬁcation, support vector machines, kernel, computer vision

4 0.042397588 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Author: Shai Shalev-Shwartz, Tong Zhang

Abstract: Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justiﬁes the effectiveness of SDCA for practical applications. Keywords: stochastic dual coordinate ascent, optimization, computational complexity, regularized loss minimization, support vector machines, ridge regression, logistic regression

5 0.039661285 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi

Abstract: In this paper we establish that the Lov´ sz ϑ function on a graph can be restated as a kernel learning a problem. We introduce the notion of SVM − ϑ graphs, on which Lov´ sz ϑ function can be apa proximated well by a Support vector machine (SVM). We show that Erd¨ s-R´ nyi random G(n, p) o e log4 n np graphs are SVM − ϑ graphs for n ≤ p < 1. Even if we embed a large clique of size Θ 1−p in a G(n, p) graph the resultant graph still remains a SVM − ϑ graph. This immediately suggests an SVM based algorithm for recovering a large planted clique in random graphs. Associated with the ϑ function is the notion of orthogonal labellings. We introduce common orthogonal labellings which extends the idea of orthogonal labellings to multiple graphs. This allows us to propose a Multiple Kernel learning (MKL) based solution which is capable of identifying a large common dense subgraph in multiple graphs. Both in the planted clique case and common subgraph detection problem the proposed solutions beat the state of the art by an order of magnitude. Keywords: orthogonal labellings of graphs, planted cliques, random graphs, common dense subgraph

6 0.037604619 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

7 0.035872612 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

8 0.035386525 83 jmlr-2013-Orange: Data Mining Toolbox in Python

9 0.035105638 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

10 0.032630954 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library

11 0.031940468 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

12 0.029379066 108 jmlr-2013-Stochastic Variational Inference

13 0.029267021 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

14 0.028672257 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

15 0.028550303 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java

16 0.027533108 8 jmlr-2013-A Theory of Multiclass Boosting

17 0.025855616 22 jmlr-2013-Classifying With Confidence From Incomplete Information

18 0.025163297 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies

19 0.024756551 73 jmlr-2013-Multicategory Large-Margin Unified Machines

20 0.024391877 82 jmlr-2013-Optimally Fuzzy Temporal Memory

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.112), (1, 0.004), (2, -0.056), (3, -0.025), (4, 0.027), (5, 0.034), (6, 0.021), (7, 0.022), (8, -0.048), (9, -0.208), (10, 0.159), (11, -0.15), (12, 0.171), (13, -0.008), (14, 0.048), (15, -0.096), (16, 0.126), (17, -0.011), (18, 0.063), (19, -0.055), (20, 0.026), (21, -0.052), (22, -0.064), (23, 0.116), (24, 0.103), (25, -0.102), (26, -0.074), (27, 0.211), (28, 0.123), (29, 0.072), (30, -0.038), (31, 0.106), (32, -0.141), (33, 0.145), (34, 0.066), (35, 0.134), (36, 0.091), (37, -0.056), (38, -0.009), (39, 0.006), (40, 0.024), (41, 0.049), (42, 0.091), (43, -0.009), (44, -0.034), (45, -0.063), (46, 0.089), (47, 0.011), (48, -0.008), (49, -0.165)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94872653 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang

2 0.62419659 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

Author: Yu-Feng Li, Ivor W. Tsang, James T. Kwok, Zhi-Hua Zhou

3 0.50308466 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

Author: David Picard, Nicolas Thome, Matthieu Cord

4 0.44905707 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Author: Shai Shalev-Shwartz, Tong Zhang

5 0.37891179 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

Author: Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco

Abstract: We present GURLS, a least squares, modular, easy-to-extend software library for efﬁcient supervised learning. GURLS is targeted to machine learning practitioners, as well as non-specialists. It offers a number state-of-the-art training strategies for medium and large-scale learning, and routines for efﬁcient model selection. The library is particularly well suited for multi-output problems (multi-category/multi-label). GURLS is currently available in two independent implementations: Matlab and C++. It takes advantage of the favorable properties of regularized least squares algorithm to exploit advanced tools in linear algebra. Routines to handle computations with very large matrices by means of memory-mapped storage and distributed task execution are available. The package is distributed under the BSD license and is available for download at https://github.com/LCSL/GURLS. Keywords: regularized least squares, big data, linear algebra 1. Introduction and Design Supervised learning has become a fundamental tool for the design of intelligent systems and the analysis of high dimensional data. Key to this success has been the availability of efﬁcient, easy-touse software packages. New data collection technologies make it easy to gather high dimensional, multi-output data sets of increasing size. This trend calls for new software solutions for the automatic training, tuning and testing of supervised learning methods. These observations motivated the design of GURLS (Grand Uniﬁed Regularized Least Squares). The package was developed to pursue the following goals: Speed: Fast training/testing procedures for learning problems with potentially large/huge number of points, features and especially outputs (e.g., classes). Memory: Flexible data management to work with large data sets by means of memory-mapped storage. Performance: ∗. Also in the Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology c 2013 Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro and Lorenzo Rosasco. TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO State of the art results in high-dimensional multi-output problems. Usability and modularity: Easy to use and to expand. GURLS is based on Regularized Least Squares (RLS) and takes advantage of all the favorable properties of these methods (Rifkin et al., 2003). Since the algorithm reduces to solving a linear system, GURLS is set up to exploit the powerful tools, and recent advances, of linear algebra (including randomized solver, ﬁrst order methods, etc.). Second, it makes use of RLS properties which are particularly suited for high dimensional learning. For example: (1) RLS has natural primal and dual formulation (hence having complexity which is the smallest between number of examples and features); (2) efﬁcient parameter selection (closed form expression of the leave one out error and efﬁcient computations of regularization path); (3) natural and efﬁcient extension to multiple outputs. Speciﬁc attention has been devoted to handle large high dimensional data sets. We rely on data structures that can be serialized using memory-mapped ﬁles, and on a distributed task manager to perform a number of key steps (such as matrix multiplication) without loading the whole data set in memory. Efforts were devoted to to provide a lean API and an exhaustive documentation. GURLS has been deployed and tested successfully on Linux, MacOS and Windows. The library is distributed under the simpliﬁed BSD license, and can be downloaded from https://github.com/LCSL/GURLS. 2. Description of the Library The library comprises four main modules. GURLS and bGURLS—both implemented in Matlab— are aimed at solving learning problems with small/medium and large-scale data sets respectively. GURLS++ and bGURLS++ are their C++ counterparts. The Matlab and C++ versions share the same design, but the C++ modules have signiﬁcant improvements, which make them faster and more ﬂexible. The speciﬁcation of the desired machine learning experiment in the library is straightforward. Basically, it is a formal description of a pipeline, that is, an ordered sequence of steps. Each step identiﬁes an actual learning task, and belongs to a predeﬁned category. The core of the library is a method (a class in the C++ implementation) called GURLScore, which is responsible for processing the sequence of tasks in the proper order and for linking the output of the former task to the input of the subsequent one. A key role is played by the additional “options” structure, referred to as OPT. OPT is used to store all conﬁguration parameters required to customize the behavior of individual tasks in the pipeline. Tasks receive conﬁguration parameters from OPT in read-only mode and—upon termination—the results are appended to the structure by GURLScore in order to make them available to subsequent tasks. This allows the user to skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. Currently, we identify six different task categories: data set splitting, kernel computation, model selection, training, evaluation and testing and performance assessment and analysis. Tasks belonging to the same category may be interchanged with each other. 2.1 Learning From Large Data Sets Two modules in GURLS have been speciﬁcally designed to deal with big data scenarios. The approach we adopted is mainly based on a memory-mapped abstraction of matrix and vector data structures, and on a distributed computation of a number of standard problems in linear algebra. For learning on big data, we decided to focus speciﬁcally on those situations where one seeks a linear model on a large set of (possibly non linear) features. A more accurate speciﬁcation of what “large” means in GURLS is related to the number of features d and the number of training 3202 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING data set optdigit landast pendigit letter isolet # of samples 3800 4400 7400 10000 6200 # of classes 10 6 10 26 26 # of variables 64 36 16 16 600 Table 1: Data sets description. examples n: we require it must be possible to store a min(d, n) × min(d, n) matrix in memory. In practice, this roughly means we can train models with up-to 25k features on machines with 8Gb of RAM, and up-to 50k features on machines with 36Gb of RAM. We do not require the data matrix itself to be stored in memory: within GURLS it is possible to manage an arbitrarily large set of training examples. We distinguish two different scenarios. Data sets that can fully reside in RAM without any memory mapping techniques—such as swapping—are considered to be small/medium. Larger data sets are considered to be “big” and learning must be performed using either bGURLS or bGURLS++ . These two modules include all the design patterns described above, and have been complemented with additional big data and distributed computation capabilities. Big data support is obtained using a data structure called bigarray, which allows to handle data matrices as large as the space available on the hard drive: we store the entire data set on disk and load only small chunks in memory when required. There are some differences between the Matlab and C++ implementations. bGURLS relies on a simple, ad hoc interface, called GURLS Distributed Manager (GDM), to distribute matrix-matrix multiplications, thus allowing users to perform the important task of kernel matrix computation on a distributed network of computing nodes. After this step, the subsequent tasks behave as in GURLS. bGURLS++ (currently in active development) offers more interesting features because it is based on the MPI libraries. Therefore, it allows for a full distribution within every single task of the pipeline. All the processes read the input data from a shared ﬁlesystem over the network and then start executing the same pipeline. During execution, each process’ task communicates with the corresponding ones. Every process maintains its local copy of the options. Once the same task is completed by all processes, the local copies of the options are synchronized. This architecture allows for the creation of hybrid pipelines comprising serial one-process-based tasks from GURLS++ . 3. Experiments We decided to focus the experimental analysis in the paper to the assessment of GURLS’ performance both in terms of accuracy and time. In our experiments we considered 5 popular data sets, brieﬂy described in Table 1. Experiments were run on a Intel Xeon 5140 @ 2.33GHz processor with 8GB of RAM, and running Ubuntu 8.10 Server (64 bit). optdigit accuracy (%) GURLS (linear primal) GURLS (linear dual) LS-SVM linear GURLS (500 random features) GURLS (1000 random features) GURLS (Gaussian kernel) LS-SVM (Gaussian kernel) time (s) landsat accuracy (%) time (s) pendigit accuracy (%) time (s) 92.3 92.3 92.3 96.8 97.5 98.3 98.3 0.49 726 7190 25.6 207 13500 26100 63.68 66.3 64.6 63.5 63.5 90.4 90.51 0.22 1148 6526 28.0 187 20796 18430 82.24 82.46 82.3 96.7 95.8 98.4 98.36 0.23 5590 46240 31.6 199 100600 120170 Table 2: Comparison between GURLS and LS-SVM. 3203 TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO Performance (%) 1 0.95 0.9 0.85 isolet(∇) letter(×) 0.8 pendigit(∆) 0.75 landsat(♦) optdigit(◦) 0.7 LIBSVM:rbf 0.65 GURLS++:rbf GURLS:randomfeatures-1000 0.6 GURLS:randomfeatures-500 0.55 0.5 0 10 GURLS:rbf 1 10 2 10 3 10 4 Time (s) 10 Figure 1: Prediction accuracy vs. computing time. The color represents the training method and the library used. In blue: the Matlab implementation of RLS with RBF kernel, in red: its C++ counterpart. In dark red: results of LIBSVM with RBF kernel. In yellow and green: results obtained using a linear kernel on 500 and 1000 random features respectively. We set up different pipelines and compared the performance to SVM, for which we used the python modular interface to LIBSVM (Chang and Lin, 2011). Automatic selection of the optimal regularization parameter is implemented identically in all experiments: (i) split the data; (ii) deﬁne a set of regularization parameter on a regular grid; (iii) perform hold-out validation. The variance of the Gaussian kernel has been ﬁxed by looking at the statistics of the pairwise distances among training examples. The prediction accuracy of GURLS and GURLS++ is identical—as expected—but the implementation in C++ is signiﬁcantly faster. The prediction accuracy of standard RLS-based methods is in many cases higher than SVM. Exploiting the primal formulation of RLS, we further ran experiments with the random features approximation (Rahimi and Recht, 2008). As show in Figure 1, the performance of this method is comparable to that of SVM at a much lower computational cost in the majority of the tested data sets. We further compared GURLS with another available least squares based toolbox: the LS-SVM toolbox (Suykens et al., 2001), which includes routines for parameter selection such as coupled simulated annealing and line/grid search. The goal of this experiment is to benchmark the performance of the parameter selection with random data splitting included in GURLS. For a fair comparison, we considered only the Matlab implementation of GURLS. Results are reported in Table 2. As expected, using the linear kernel with the primal formulation—not available in LS-SVM—is the fastest approach since it leverages the lower dimensionality of the input space. When the Gaussian kernel is used, GURLS and LS-SVM have comparable computing time and classiﬁcation performance. Note, however, that in GURLS the number of parameter in the grid search is ﬁxed to 400, while in LS-SVM it may vary and is limited to 70. The interesting results obtained with the random features implementation in GURLS, make it an interesting choice in many applications. Finally, all GURLS pipelines, in their Matlab implementation, are faster than LS-SVM and further improvements can be achieved with GURLS++ . Acknowledgments We thank Tomaso Poggio, Zak Stone, Nicolas Pinto, Hristo S. Paskov and CBCL for comments and insights. 3204 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING References C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www. csie.ntu.edu.tw/˜cjlin/libsvm. A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, volume 21, pages 1313–1320, 2008. R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classiﬁcation. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003. J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientiﬁc, 2001. ISBN 981-238-151-1. 3205

6 0.31692597 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

7 0.29967186 83 jmlr-2013-Orange: Data Mining Toolbox in Python

8 0.26600957 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library

9 0.25960681 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

10 0.25821391 82 jmlr-2013-Optimally Fuzzy Temporal Memory

11 0.25319523 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

12 0.22610059 22 jmlr-2013-Classifying With Confidence From Incomplete Information

13 0.22051001 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

14 0.20137139 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

15 0.1881935 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies

16 0.18682881 73 jmlr-2013-Multicategory Large-Margin Unified Machines

17 0.17633089 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

18 0.16616586 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

19 0.16537231 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

20 0.15801679 106 jmlr-2013-Stationary-Sparse Causality Network Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.04), (5, 0.119), (6, 0.032), (10, 0.044), (20, 0.014), (44, 0.026), (58, 0.505), (68, 0.018), (75, 0.043), (85, 0.014), (87, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.67224413 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang

2 0.26165745 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efﬁcient algorithms for distributed optimization in statistical settings involving large-scale data sets. The ﬁrst algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

3 0.26151687 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

Author: Alberto N. Escalante-B., Laurenz Wiskott

Abstract: Supervised learning from high-dimensional data, for example, multimedia data, is a challenging task. We propose an extension of slow feature analysis (SFA) for supervised dimensionality reduction called graph-based SFA (GSFA). The algorithm extracts a label-predictive low-dimensional set of features that can be post-processed by typical supervised algorithms to generate the ﬁnal label or class estimation. GSFA is trained with a so-called training graph, in which the vertices are the samples and the edges represent similarities of the corresponding labels. A new weighted SFA optimization problem is introduced, generalizing the notion of slowness from sequences of samples to such training graphs. We show that GSFA computes an optimal solution to this problem in the considered function space and propose several types of training graphs. For classiﬁcation, the most straightforward graph yields features equivalent to those of (nonlinear) Fisher discriminant analysis. Emphasis is on regression, where four different graphs were evaluated experimentally with a subproblem of face detection on photographs. The method proposed is promising particularly when linear models are insufﬁcient as well as when feature selection is difﬁcult. Keywords: slow feature analysis, feature extraction, classiﬁcation, regression, pattern recognition, training graphs, nonlinear dimensionality reduction, supervised learning, implicitly supervised, high-dimensional data, image analysis

4 0.26106688 15 jmlr-2013-Bayesian Canonical Correlation Analysis

Author: Arto Klami, Seppo Virtanen, Samuel Kaski

Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulﬁlling the promise yet; we introduce a novel efﬁcient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-speciﬁc components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation

5 0.25909194 59 jmlr-2013-Large-scale SVD and Manifold Learning

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

Abstract: This paper examines the efﬁcacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystr¨ m and Column sampling methods. We present a theoretical o comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the largescale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystr¨ m approximation is superior to the Column sampling method for this o task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classiﬁcation with the labeled CMU-PIE data set. Keywords: low-rank approximation, manifold learning, large-scale matrix factorization

6 0.25858641 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

7 0.25835368 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

8 0.25754571 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

9 0.25735271 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

10 0.25719711 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions

11 0.25712264 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

12 0.25673202 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

13 0.2565763 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

14 0.25630805 73 jmlr-2013-Multicategory Large-Margin Unified Machines

15 0.2559931 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

16 0.25580561 114 jmlr-2013-The Rate of Convergence of AdaBoost

17 0.25520462 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

18 0.254895 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

19 0.25486866 120 jmlr-2013-Variational Algorithms for Marginal MAP

20 0.25435364 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression