jmlr jmlr2010 jmlr2010-110 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojtěch Franc
Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the specific algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. Keywords: support vector machines, kernels, large-scale learning, Python, Octave, R
Reference: text
sentIndex sentText sentNum sentScore
1 Journal of Machine Learning Research 11 (2010) 1799-1802 Submitted 10/09; Revised 3/10; Published 6/10 The SHOGUN Machine Learning Toolbox S¨ ren Sonnenburg∗ o SOEREN . [sent-1, score-0.044]
2 28/29, 10587 Berlin, Germany Gunnar R¨ tsch a Sebastian Henschel Christian Widmer Jonas Behr Alexander Zien† Fabio de Bona GUNNAR . [sent-4, score-0.103]
3 28/29, 10587 Berlin, Germany Vojtˇ ch Franc e XFRANCV @ CMP. [sent-26, score-0.036]
4 It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. [sent-30, score-0.095]
5 We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. [sent-32, score-0.154]
6 With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. [sent-33, score-0.101]
7 SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. [sent-34, score-0.114]
8 The source code is freely available under the GNU General Public License, Version 3 at http://www. [sent-35, score-0.031]
9 Introduction With the great advancements of machine learning in the past few years, many new learning algorithms have been proposed, implemented in C/C++, and made publicly available. [sent-39, score-0.067]
10 However, for example, there currently exist more than 20 different publicly available implementations of Support Vector Machine (SVM) solvers. [sent-51, score-0.129]
11 Each one comes with its own interface, a small set of available kernel functions, and unique benefits and drawbacks. [sent-52, score-0.063]
12 There is no single unified way of interfacing with these implementations, even though they all are based on essentially the same methodology of supervised learning. [sent-53, score-0.028]
13 This restraints users from fully taking advantage of the recent developments in machine learning algorithms. [sent-54, score-0.03]
14 This motivated us to develop a machine learning toolbox that provides an easy, unified way for solving certain types of machine learning problems. [sent-55, score-0.078]
15 The result is a toolbox, called SHOGUN, with a focus on large-scale learning using kernel methods and SVMs. [sent-56, score-0.063]
16 It provides a generic interface to 15 SVM implementations, among them SVMlight, LibSVM, GPDT, SVMLin, LibLinear, and OCAS. [sent-57, score-0.115]
17 The SVMs can be easily combined with more than 35 different kernel functions. [sent-58, score-0.063]
18 Moreover, it offers options for using precomputed kernels and allows easy integration of new implementations of kernels. [sent-63, score-0.177]
19 One of SHOGUN’s key features is the combined kernel to construct weighted linear combinations of multiple kernels that may even be defined on different input domains. [sent-64, score-0.149]
20 Also, several Multiple Kernel Learning algorithms based on different regularization strategies are available to optimize the weighting of the kernels (e. [sent-65, score-0.086]
21 In addition to kernel and distance based methods, SHOGUN implements many linear methods and features algorithms to train hidden Markov models. [sent-71, score-0.095]
22 R¨ tsch and Sonnenburg, 2007; a Schweikert et al. [sent-73, score-0.103]
23 The input feature objects can be dense or sparse vectors of strings, integers (8, 16, 32 or 64 bit; signed or unsigned), or floating point numbers (32 or 64 bit), and can be converted into different feature types. [sent-75, score-0.034]
24 Finally, several commonly used performance measures, like accuracy and area under ROC or precision-recall curves, are implemented in SHOGUN. [sent-79, score-0.035]
25 Moreover, whenever possible, we implemented auxiliary routines that allow faster computation of combinations of kernel elements that lead to significant speedup during training (for some SVM implementations, e. [sent-82, score-0.098]
26 This allowed us to use SHOGUN for solving several large-scale learning problems in biological sequence analysis, for example, splice site recognition with up to 50 million example sequences for training (Sonnenburg et al. [sent-87, score-0.076]
27 , 2007; Franc and Sonnenburg, 2009) and transcription start site recog1. [sent-88, score-0.095]
28 Complete lists of SVM and kernel implementations together with user and developer documentation is available at http://www. [sent-89, score-0.217]
29 1800 T HE SHOGUN M ACHINE L EARNING T OOLBOX nition with almost 7 billion test sequences (Sonnenburg et al. [sent-92, score-0.044]
30 SHOGUN’s core functions are encapsulated in a library (libshogun) and are easily accessible and extendible by C++ application developers. [sent-94, score-0.105]
31 What sets SHOGUN apart from many other machine learning toolboxes, is that it provides interactive user interfaces to most major scripting languages that are currently used in scientific computing, in particular Python, MATLAB, Octave, R, and a command-line version. [sent-95, score-0.26]
32 All classes and functions are documented and come with over 600 examples and a tutorial for new users and developers is part of the release. [sent-96, score-0.03]
33 Maintaining high code quality is ensured by a test suite that supports running the algorithms for each interface on predefined inputs in order to detect breakage. [sent-102, score-0.146]
34 Modular, Extendible Object-Oriented Design SHOGUN is implemented in an object-oriented way using C++ as the programming language. [sent-105, score-0.035]
35 All objects inherit from CSGObject, which provides means for garbage collection via reference counting, serialization, and versioning of the object. [sent-106, score-0.062]
36 The implementations of many classes employ templates enabling SHOGUN’s support of many different data types without code duplication. [sent-107, score-0.127]
37 As the source code and user documentation is automatically generated using doxygen and written in-place in the header files, it also drastically reduces the amount of work needed to maintain the documentation. [sent-108, score-0.15]
38 As an example of SHOGUNs object-oriented design, consider the class CClassifier: From this class, CKernelMachine, for example, is derived and provides basic functions for applying a trained kernel classifier (computing f (x) = ∑N αi k(x, xi ) + b) thus enabling code re-use wheni=1 ever possible. [sent-109, score-0.127]
39 Interfaces to Scripting Languages and Applications Built around SHOGUN’s core are two types of interfaces: A modular interface that makes use of the SWIG (http://www. [sent-116, score-0.198]
40 Thanks to SWIG, the modular interface provides the exact same objects in a modular object-oriented way that are available from C++ to other languages, such as R, Python, and Octave. [sent-119, score-0.315]
41 Using so-called typemaps, it is convenient to provide type mappings from the native datatype used in the interface to SHOGUN. [sent-120, score-0.115]
42 For example, a function void set features(double* features, int n, int m) can be called directly from Octave with a single matrix argument, for example, set features(randn(3,4)). [sent-121, score-0.072]
43 The variables n and m are then automatically set to the matrix dimensions and together with a data pointer passed to the SHOGUN core. [sent-122, score-0.028]
44 SHOGUN also provides a static interface with the same structure for all supported plattforms , including the command-line interface where inputs are provided as either strings or files. [sent-123, score-0.307]
45 1801 ¨ S ONNENBURG , R ATSCH , H ENSCHEL , W IDMER , B EHR , Z IEN , DE B ONA , G EHL , B INDER AND F RANC implemented independent of the target language through the class CSGInterface, which provides abstract functions to deliver or obtain data from any particular plattform. [sent-134, score-0.035]
46 A community around SHOGUN is continuously developing, with a growing number of projects building on it (cf. [sent-135, score-0.035]
47 org/software/tags/shogun) and a mailing list with more than 100 subscribed users. [sent-137, score-0.028]
48 4 By 10/2009, there had been at least 1, 100 installations under the Linux distributions Debian and Ubuntu. [sent-138, score-0.066]
49 We will continue to develop SHOGUN and are confident that it is and will continue to be useful, and will make an increasing impact beyond the machine learning community by benefiting diverse applications. [sent-139, score-0.035]
50 Support vector machines and o a kernels for computational biology. [sent-143, score-0.086]
51 Efficient and accurate u l p -norm multiple kernel learning. [sent-149, score-0.063]
wordName wordTfidf (topN-words)
[('shogun', 0.73), ('sonnenburg', 0.252), ('interface', 0.115), ('tsch', 0.103), ('python', 0.102), ('behr', 0.1), ('bona', 0.1), ('fabio', 0.1), ('schweikert', 0.1), ('tuebingen', 0.094), ('mpg', 0.089), ('octave', 0.088), ('kernels', 0.086), ('gehl', 0.085), ('franc', 0.083), ('modular', 0.083), ('zien', 0.083), ('christian', 0.079), ('interfaces', 0.079), ('toolbox', 0.078), ('jonas', 0.077), ('binder', 0.077), ('gunnar', 0.071), ('alexander', 0.066), ('atsch', 0.066), ('ehl', 0.066), ('ehr', 0.066), ('enschel', 0.066), ('extendible', 0.066), ('friedrich', 0.066), ('gpdt', 0.066), ('henschel', 0.066), ('idmer', 0.066), ('ien', 0.066), ('inder', 0.066), ('installations', 0.066), ('kloft', 0.066), ('miescher', 0.066), ('ona', 0.066), ('onnenburg', 0.066), ('ranc', 0.066), ('scripting', 0.066), ('trifense', 0.066), ('vojt', 0.066), ('widmer', 0.066), ('implementations', 0.063), ('kernel', 0.063), ('berlin', 0.06), ('gmbh', 0.057), ('languages', 0.054), ('transcription', 0.051), ('svmlight', 0.051), ('swig', 0.051), ('czech', 0.047), ('tu', 0.044), ('ren', 0.044), ('site', 0.044), ('billion', 0.044), ('static', 0.041), ('accessible', 0.039), ('ong', 0.037), ('planck', 0.037), ('germany', 0.037), ('documentation', 0.036), ('int', 0.036), ('strings', 0.036), ('ch', 0.036), ('svms', 0.035), ('implemented', 0.035), ('community', 0.035), ('sebastian', 0.034), ('linux', 0.034), ('eu', 0.034), ('objects', 0.034), ('currently', 0.034), ('enabling', 0.033), ('bingen', 0.033), ('publicly', 0.032), ('million', 0.032), ('hidden', 0.032), ('svm', 0.031), ('uni', 0.031), ('code', 0.031), ('users', 0.03), ('interfacing', 0.028), ('arts', 0.028), ('header', 0.028), ('laskov', 0.028), ('oating', 0.028), ('fer', 0.028), ('precomputed', 0.028), ('inherit', 0.028), ('toolboxes', 0.028), ('pointer', 0.028), ('cygwin', 0.028), ('developer', 0.028), ('doxygen', 0.028), ('mailing', 0.028), ('user', 0.027), ('visibility', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
Author: Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojtěch Franc
Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the specific algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. Keywords: support vector machines, kernels, large-scale learning, Python, Octave, R
2 0.070100315 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
3 0.065889992 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
Author: Carl Edward Rasmussen, Hannes Nickisch
Abstract: The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace’s method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks. Keywords: Gaussian processes, nonparametric Bayes, probabilistic regression and classification Gaussian processes (GPs) (Rasmussen and Williams, 2006) have convenient properties for many modelling tasks in machine learning and statistics. They can be used to specify distributions over functions without having to commit to a specific functional form. Applications range from regression over classification to reinforcement learning, spatial models, survival and other time series1 models. Predictions of GP models come with a natural confidence measure: predictive error-bars. Although the implementation of the basic principles in the simplest case is straight forward, various complicating features are often desired in practice. For example, a GP is determined by a mean function and a covariance function, but these functions are mostly difficult to specify fully a priori, and typically they are given in terms of hyperparameters, that is, parameters which have to be inferred. Another source of difficulty is the likelihood function. For Gaussian likelihoods, inference is analytically tractable; however, in many tasks, Gaussian likelihoods are not appropriate, and approximate inference methods such as Expectation Propagation (EP) (Minka, 2001), Laplace’s approximation (LA) (Williams and Barber, 1998) and variational bounds (VB) (Gibbs and MacKay, 2000) become necessary (Nickisch and Rasmussen, 2008). In case of large training data, approximations (Candela and Rasmussen, 2005) like FITC (Snelson and Ghahramani, 2006) are needed. The GPML toolbox is designed to overcome these hurdles with its variety of mean, covariance and likelihood functions as well as inference methods, while being simple to use and easy to extend. ∗. Also at Max Planck Institute for Biological Cybernetics, Spemannstraße 38, 72076 T¨ bingen, Germany. u 1. Note, that here we typically think of GPs with a more general index set than time. ©2010 Carl Edward Rasmussen and Hannes Nickisch. R ASMUSSEN AND N ICKISCH 1. Implementation The GPML toolbox can be obtained from http://gaussianprocess.org/gpml/code/matlab/ and also http://mloss.org/software/view/263/ under the FreeBSD license. Based on simple interfaces for covariance, mean, likelihood functions as well as inference methods, we offer full compatibility to both Matlab 7.x2 and GNU Octave 3.2.x.3 Special attention has been given to properly disentangle covariance, likelihood and mean hyperparameters. Also, care has been taken to avoid numerical inaccuracies, for example, safe likelihood evaluations for extreme inputs and stable matrix operations. For example, the covariance matrix K can become numerically close to singular making its naive inversion numerically unsafe. We handle these situations in a principled way4 such that Cholesky decompositions are computed of well-conditioned matrices only. As a result, our code shows a high level of robustness along the full spectrum of possible hyperparameters. The focus of the toolbox is on approximate inference using dense matrix algebra. We currently do not support covariance matrix approximation techniques to deal with large numbers of training examples n. Looking at the (growing) body of literature on sparse approximations, this knowledge is still somewhat in flux, and consensus on the best approaches has not yet been reached. We provide stable and modular code checked by an exhaustive suite of test cases. A single function gp.m serves as main interface to the user—it can make inference and predictions and allows the mean, covariance and likelihood function as well as the inference methods to be specified freely. Furthermore, gp.m enables convenient learning of the hyperparameters by maximising the log marginal likelihood ln Z. One of the particularly appealing properties of GP models is that principled and practical approaches exist for learning the parameters of mean, covariance and likelihood functions. Good adaptation of such parameters can be essential to obtain both high quality predictions and insights into the properties of the data. The GPML toolbox is particularly flexible, including a large library of different covariance and mean functions, and flexible ways to combine these into more expressive, specialised functions. The user can choose between two gradient-based optimisers: one uses conjugate gradients (CG)5 and the other one relies on a quasi-Newton scheme.6 ∂ Computing the derivatives w.r.t. hyperparameters ∂θi ln Z with gp.m does not need any extra programming effort; every inference method automatically collects the respective derivatives from the mean, covariance and likelihood functions and passes them to gp.m. Our documentation comes in two pieces: a hypertext user documentation7 doc/index.html with examples and code browsing and a technical documentation8 doc/manual.pdf focusing on the interfaces and more technical issues. A casual user will use the hypertext document to quickly get his data analysed, however a power user will consult the pdf document once he wants to include his own mean, covariance, likelihood and inference routines or learn about implementation details. 2. 3. 4. 5. 6. 7. 8. Matlab is available from MathWorks, http://www.mathworks.com/. Octave is available from the Free Software Foundation, http://www.gnu.org/software/octave/. We do not consider the “blind” addition of a “small ridge” to K a principled way. Carl Rasmussen’s code is available at http://www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/. Peter Carbonetto’s wrapper can be found at http://www.cs.ubc.ca/˜pcarbo/lbfgsb-for-matlab.html. Documentation can be found at http://www.gaussianprocess.org/gpml/code/matlab/doc/index.html. Technical docs are available at http://www.gaussianprocess.org/gpml/code/matlab/doc/manual.pdf. 3012 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX 2. The GPML Toolbox We illustrate the modular structure of the GPML toolbox by means of a simple code example. GPs are used to formalise and update knowledge about distributions over functions. A GP prior distribution on an unknown latent function f ∼ GP (mφ (x), kψ (x, x′ )), consists of a mean function m(x) = E[ f (x)], and a covariance function k(x, x) = E[( f (x) − m(x))( f (x′ ) − m(x′ ))], both of which typically contain hyperparameters φ and ψ, which we want to fit in the light of data. We generally assume independent observations, that is, input/output pairs (xi , yi ) of f with joint likelihood Pρ (y|f) = ∏n Pρ (yi | f (xi )) factorising over cases. Finally, after specification of the prior and i=1 fitting of the hyperparameters θ = {φ, ψ, ρ}, we wish to compute predictive distributions for test cases. 1 2 3 4 5 6 7 % 1) SET UP THE GP : COVARIANCE ; MEAN , LIKELIHOOD , INFERENCE METHOD mf = { ’ meanSum ’ ,{ ’ meanLinear ’, @meanConst }}; a = 2; b = 1; % m(x) = a*x+b cf = { ’ covSEiso ’}; sf = 1; ell = 0.7; % squared exponential covariance funct lf = ’ likLaplace ’; sn = 0.2; % assume Laplace noise with variance sn ˆ2 hyp0 . mean = [a;b ]; hyp0 . cov = log([ ell ; sf ]); hyp0 . lik = log( sn ); % hypers inf = ’ infEP ’; % specify expectation propagation as inference method % 2) MINIMISE NEGATIVE LOG MARGINAL LIKELIHOOD nlZ wrt . hyp ; do 50 CG steps Ncg = 50; [hyp , nlZ ] = minimize ( hyp0 , ’gp ’, -Ncg , inf , mf , cf , lf , X , y ); % 3) PREDICT AT UNKNOWN TEST INPUTS [ymu , ys2 ] = gp (hyp , inf , mf , cf , lf , X , y , Xs ); % test input Xs In line 1, we specify the mean mφ (x) = a⊤ x + b of the GP with hyperparameters φ = {a, b}. First, the functional form of the mean function is given and its parameters are initialised. The desired mean function, happens not to exist in the library of mean functions; instead we have to make a composite mean function from simple constituents. This is done using a nested cell array containing the algebraic expression for m(x): As the sum of a linear (mean/meanLinear.m) and a constant mean function (mean/meanConst.m) it is an affine function. In addition to linear and constant mean functions, the toolbox offers m(x) = 0 and m(x) = 1. These simple mean functions can be combined by composite mean functions to obtain sums (mean/meanSum.m) m(x) = ∑ j m j (x), products m(x) = ∏ j m j (x), scaled versions m(x) = αm0 (x) and powers m(x) = m0 (x)d . This flexible mechanism is used for convenient specification of an extensible algebra of mean functions. Note that functions are referred to either as name strings ’meanConst’ or alternatively function handles @meanConst. The order of components of the hyperparameters φ is the same as in the specification of the cell array. Every mean function implements its evaluation m = mφ (X) and first derivative ∂ computation mi = ∂φi mφ (X) on a data set X. In the same spirit, the squared exponential covariance kψ (x, x′ ) = σ f ² exp(− x − x′ 2 /2ℓ2 ) (cov/covSEiso.m) with hyperparameters ψ = {ln ℓ, ln σ f } is set up in line 2. Note, that the hyperparameters are represented by the logarithms, as these parameters are naturally positive. Many other simple covariance functions are contained in the toolbox. Among others, we offer linear, constant, Mat´ rn, rational quadratic, polynomial, periodic, neural network and finite support coe variance functions. Composite covariance functions allow for sums k(x, x′ ) = ∑ j k j (x, x′ ), products k(x, x′ ) = ∏ j k j (x, x′ ), positive scaling k(x, x′ ) = σ2 k0 (x, x′ ) and masking of components f k(x, x′ ) = k0 (xI , x′ ) with I ⊆ [1, 2, .., D], x ∈ RD . Again, the interface is simple since only the I ∂ evaluation of the covariance matrix K = kψ (X) and its derivatives ∂i K = ∂ψi kψ (X) on a data set X are required. Furthermore, we need cross terms k∗ = kψ (X, x∗ ) and k∗∗ = kψ (x∗ , x∗ ) for prediction. There are no restrictions on the composition of both mean and covariance functions—any combination is allowed including nested composition. 3013 R ASMUSSEN AND N ICKISCH √ √ The Laplace (lik/likLaplace.m) likelihood Pρ (y| f ) = exp(− 2/σn |y − f |)/ 2σn with hyperparameters ρ = {ln σn } is specified in line 3. There are only simple likelihood functions: Gaussian, Sech-squared, Laplacian and Student’s t for ordinary and sparse regression as well as the error and the logistic function for classification. Again, the same inference code is used for any likelihood function. Although the specification of likelihood functions is simple for the user, writing new likelihood functions is slightly more involved as different inference methods require access to different properties; for example, LA requires second derivatives and EP requires derivatives of moments. All hyperparameters θ = {φ, ψ, ρ} are stored in a struct hyp.{mean,cov,lik}, which is initialised in line 4; we select the approximate inference algorithm EP (inf/infEP.m) in line 5. We optimise the hyperparameters θ ≡ hyp by calling the CG optimiser (util/minimize.m) with initial value θ0 ≡ hyp0 in line 6 allowing at most N = 50 evaluations of the EP approximation to the marginal likelihood ZEP (θ) as done by gp.m. Here, D = (X, y) ≡ (X,y) is the training data where X = {x1 , .., xn } and y ∈ Rn . Under the hood, gp.m computes in every step a Gaussian ∂ posterior approximation and the derivatives ∂θ ln ZEP (θ) of the marginal likelihood by calling EP. Predictions with optimised hyperparameters are done in line 7, where we call gp.m with the unseen test inputs X∗ ≡ Xs as additional argument. As a result, we obtain the approximate marginal predictive mean E[P(y∗ |D , X∗ )] ≡ ymu and the predictive variance V[P(y∗ |D , X∗ )] ≡ ys2. Likelihood \ Inference Gaussian Sech-squared Laplacian Student’s t Error function Logistic function Exact FITC EP Laplace VB Type, Output Domain regression, R regression, R regression, R regression, R classification, {±1} classification, {±1} Alternate Name logistic distribution double exponential probit regression logit regression Table 1: Likelihood ↔ inference compatibility in the GPML toolbox Table 1 gives the legal likelihood/inference combinations. Exact inference and the FITC approximation support the Gaussian likelihood only. Variational Bayesian (VB) inference is applicable to all likelihoods. Expectation propagation (EP) for the Student’s t likelihood is inherently unstable due to its non-log-concavity. The Laplace approximation (LA) for Laplace likelihoods is not sensible due to the non-differentiable peak of the Laplace likelihood. Special care has been taken for the non-convex optimisation problem imposed by the combination Student’s t likelihood and LA. If the number of training examples is larger than a few thousand, dense matrix computations become too slow. We provide the FITC approximation for regression with Gaussian likelihood where ˜ instead of the exact covariance matrix K, a low-rank plus diagonal matrix K = Q + diag(K − Q) ⊤ K−1 K is used. The matrices K and K contain covariances and cross-covariances where Q = Ku uu u uu u of and between inducing inputs ui and data points x j . Using inf/infFITC.m together with any covariance function wrapped into cov/covFITC.m makes the computations feasible for large n. Acknowledgments Thanks to Ed Snelson for assisting with the FITC approximation. 3014 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX References Joaquin Qui˜ onero Candela and Carl E. Rasmussen. A unifying view of sparse approximate Gausn sian process regression. Journal of Machine Learning Research, 6(6):1935–1959, 2005. Mark N. Gibbs and David J. C. MacKay. Variational Gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6):1458–1464, 2000. Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362–369. Morgan Kaufmann, 2001. Hannes Nickisch and Carl E. Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9:2035–2078, 10 2008. Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006. Ed Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, 2006. Christopher K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(20):1342–1351, 1998. 3015
4 0.051425599 15 jmlr-2010-Approximate Tree Kernels
Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller
Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels
5 0.05047629 93 jmlr-2010-PyBrain
Author: Tom Schaul, Justin Bayer, Daan Wierstra, Yi Sun, Martin Felder, Frank Sehnke, Thomas Rückstieß, Jürgen Schmidhuber
Abstract: PyBrain is a versatile machine learning library for Python. Its goal is to provide flexible, easyto-use yet still powerful algorithms for machine learning tasks, including a variety of predefined environments and benchmarks to test and compare algorithms. Implemented algorithms include Long Short-Term Memory (LSTM), policy gradient methods, (multidimensional) recurrent neural networks and deep belief networks. Keywords: Python, neural networks, reinforcement learning, optimization
6 0.048533257 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project
7 0.048122589 44 jmlr-2010-Graph Kernels
8 0.047827169 118 jmlr-2010-libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models
9 0.047287121 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
10 0.037906744 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
11 0.033643283 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine
12 0.031857457 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design
13 0.030453678 40 jmlr-2010-Fast and Scalable Local Kernel Machines
14 0.026865833 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
15 0.026429724 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
16 0.023600962 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
17 0.022987727 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
18 0.022748617 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
19 0.022077521 48 jmlr-2010-How to Explain Individual Classification Decisions
20 0.02094109 39 jmlr-2010-FastInf: An Efficient Approximate Inference Library
topicId topicWeight
[(0, -0.094), (1, 0.012), (2, -0.061), (3, 0.036), (4, 0.016), (5, 0.066), (6, -0.119), (7, -0.13), (8, -0.088), (9, 0.061), (10, 0.016), (11, -0.157), (12, -0.08), (13, -0.001), (14, 0.053), (15, 0.102), (16, 0.037), (17, -0.076), (18, 0.051), (19, -0.085), (20, -0.116), (21, 0.04), (22, 0.07), (23, -0.114), (24, -0.07), (25, -0.002), (26, -0.165), (27, 0.064), (28, 0.124), (29, -0.023), (30, 0.02), (31, -0.043), (32, -0.026), (33, -0.077), (34, 0.011), (35, 0.073), (36, -0.095), (37, 0.237), (38, -0.179), (39, 0.069), (40, -0.02), (41, -0.229), (42, 0.07), (43, 0.042), (44, 0.152), (45, 0.126), (46, -0.038), (47, 0.104), (48, 0.042), (49, 0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.95624584 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
Author: Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojtěch Franc
Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the specific algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. Keywords: support vector machines, kernels, large-scale learning, Python, Octave, R
2 0.60857248 93 jmlr-2010-PyBrain
Author: Tom Schaul, Justin Bayer, Daan Wierstra, Yi Sun, Martin Felder, Frank Sehnke, Thomas Rückstieß, Jürgen Schmidhuber
Abstract: PyBrain is a versatile machine learning library for Python. Its goal is to provide flexible, easyto-use yet still powerful algorithms for machine learning tasks, including a variety of predefined environments and benchmarks to test and compare algorithms. Implemented algorithms include Long Short-Term Memory (LSTM), policy gradient methods, (multidimensional) recurrent neural networks and deep belief networks. Keywords: Python, neural networks, reinforcement learning, optimization
3 0.37621635 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design
Author: Dirk Gorissen, Ivo Couckuyt, Piet Demeester, Tom Dhaene, Karel Crombecq
Abstract: An exceedingly large number of scientific and engineering fields are confronted with the need for computer simulations to study complex, real world phenomena or solve challenging design problems. However, due to the computational cost of these high fidelity simulations, the use of neural networks, kernel methods, and other surrogate modeling techniques have become indispensable. Surrogate models are compact and cheap to evaluate, and have proven very useful for tasks such as optimization, design space exploration, prototyping, and sensitivity analysis. Consequently, in many fields there is great interest in tools and techniques that facilitate the construction of such regression models, while minimizing the computational cost and maximizing model accuracy. This paper presents a mature, flexible, and adaptive machine learning toolkit for regression modeling and active learning to tackle these issues. The toolkit brings together algorithms for data fitting, model selection, sample selection (active learning), hyperparameter optimization, and distributed computing in order to empower a domain expert to efficiently generate an accurate model for the problem or data at hand. Keywords: surrogate modeling, metamodeling, function approximation, model selection, adaptive sampling, active learning, distributed computing 1. Background and Motivation In many science and engineering problems researchers make heavy use of computer simulation codes in order to replace expensive physical experiments and improve the quality and performance of engineered products and devices. Such simulation activities are collectively referred to as computational science/engineering. Unfortunately, while allowing scientists more flexibility to study phenomena under controlled conditions, computer simulations require a substantial investment of c 2010 Dirk Gorissen, Ivo Couckuyt, Piet Demeester, Tom Dhaene and Karel Crombecq. G ORISSEN , C OUCKUYT, D EMEESTER , D HAENE AND C ROMBECQ computation time. One simulation may take many minutes, hours, days or even weeks, quickly rendering parameter studies impractical (Forrester et al., 2008; Simpson et al., 2008). Of the different ways to deal with this problem, this paper is concerned with the construction of simpler approximation models to predict the system performance and develop a relationship between the system inputs and outputs. When properly constructed, these approximation models mimic the behavior of the simulation accurately while being computationally cheap(er) to evaluate. Different approximation methods exist, each with their relative merits. This work concentrates on the use of data-driven, global approximations using compact surrogate models (also known as metamodels, replacement models, or response surface models). Examples include: rational functions, Kriging models, Artificial Neural Networks (ANN), splines, and Support Vector Machines (SVM). Once such a global approximation is available it is of great use for gaining insight into the behavior of the underlying system. The surrogate may be easily queried, optimized, visualized, and seamlessly integrated into CAD/CAE software packages. The challenge is thus how to generate an approximation model that is as accurate as possible over the complete domain of interest while minimizing the simulation cost. Solving this challenge involves multiple sub-problems that must be addressed: how to interface with the simulation code, how to run simulations (locally, or on a cluster or cloud), which model type to approximate the data with and how to set the model complexity (e.g., topology of a neural network), how to estimate the model quality and ensure the domain expert trusts the model, how to decide which simulations to run (data collection), etc. The data collection aspect is worth emphasizing. Since data is computationally expensive to obtain and the optimal data distribution is not known up front, data points should be selected iteratively, there where the information gain will be the greatest. A sampling function is needed that minimizes the number of sample points selected in each iteration, yet maximizes the information gain of each iteration step. This process is called adaptive sampling but is also known as active learning, or sequential design. There is a complex dependency web between these different options and dealing with these dependencies is non-trivial, particularly for a domain expert for whom the surrogate model is just an intermediate step towards solving a larger, more important problem. Few domain experts will be experts in the intricacies of efficient sampling and modeling strategies. Their primary concern is obtaining an accurate replacement metamodel for their problem as fast as possible and with minimal overhead (Gorissen et al., 2009d). As a result these choices are often made in a pragmatic, sometimes even ad-hoc, manner. This paper discusses an advanced, and integrated software framework that provides a flexible and rigorous means to tackle such problems. This work lies at the intersection of Machine Learning/AI, Modeling and Simulation, and Distributed Computing. The methods developed are applicable to any domain where a cheap, accurate, approximation is needed to replace some expensive reference model. Our experience has been that the availability of such a framework can facilitate the transfer of knowledge from surrogate modeling researchers and lower the barrier of entry for domain experts. 2. SUMO Toolbox The platform in question is the Matlab SUrrogate MOdeling (SUMO) Toolbox, illustrated in Figure 1. Given a simulation engine (Fluent, Cadence, Abaqus, HFSS, etc.) or other data source (data 2052 A S URROGATE M ODELING AND A DAPTIVE S AMPLING TOOLBOX FOR C OMPUTER BASED D ESIGN Figure 1: The SUMO Toolbox is a flexible framework for accurate global surrogate modeling and adaptive sampling (active learning). It features a rich set of plugins, is applicable to a wide range of domains, and can be applied in an autonomous, black-box fashion, or under full manual control. Written in Matlab and Java it is fully cross platform and comes with a large (60+) number of example problems. set, Matlab script, Java class, etc.), the toolbox drives the data source to produce a surrogate model within the time and accuracy constraints set by the user. The SUMO Toolbox adopts a microkernel design philosophy with many different plugins available for each of the different sub-problems:1 model types (rational functions, Kriging, splines, SVM, ANN, etc.), hyperparameter optimization algorithms (Particle Swarm Optimization, Efficient Global Optimization, simulated annealing, Genetic Algorithm, etc.), model selection algorithms (cross validation, AIC, Leave-out set, etc.), sample selection (random, error based, density based, hybrid, etc.), Design of Experiments (Latin hypercube, Box-Bhenken, etc.), and sample evaluation methods (local, on a cluster or grid). The behavior of each software component is configurable through a central XML file and components can easily be added, removed or replaced by custom implementations. In addition the toolbox provides ‘meta’ plugins. For example to automatically select the best model type for a given problem (Gorissen et al., 2009d) or to use multiple model selection or sample selection criteria in concert (Gorissen et al., 2010). Furthermore, there is built-in support for high performance computing. On the modeling side, the model generation process can take full advantage of multi-core CPUs and even of a complete cluster or grid. This can result in significant speedups for model types where the fitting process can be expensive (e.g., neural networks). Likewise, sample evaluation (simulation) can occur locally (with the option to take advantage of multi-core architectures) or on a separate compute cluster or grid (possibly accessed through a remote head-node). All interfacing with the grid middleware 1. The full list of plugins and features can be found at http://www.sumowiki.intec.ugent.be. 2053 G ORISSEN , C OUCKUYT, D EMEESTER , D HAENE AND C ROMBECQ (submission, job monitoring, rescheduling of failed/lost simulation points, etc.) is handled transparently and automatically (see Gorissen et al., 2009c for more details). Also, the sample evaluation component runs in parallel with the other components (non-blocking) and not sequentially. This allows for an optimal use of computational resources. In addition the SUMO Toolbox contains extensive logging and profiling capabilities so that the modeling process can easily be tracked and the modeling decisions understood. Once a final model has been generated, a GUI tool is available to visually explore the model (including derivatives and prediction uncertainty), assess its quality, and export it for use in other software tools. 3. Applications The SUMO Toolbox has already been applied successfully to a very wide range of applications, including RF circuit block modeling (Gorissen et al., 2009b), hydrological modeling (Couckuyt et al., 2009), Electronic Packaging (Zhu and Franzon, 2009), aerodynamic modeling (Gorissen et al., 2009a), process engineering (Stephens et al., 2009), and automotive data modeling (Gorissen et al., 2010). Besides global modeling capabilities, the SUMO Toolbox also includes a powerful optimization framework based on the Efficient Global Optimization framework developed by Jones et al. (1998). As of version 6.1, the toolbox also contains an example of how the framework can also be applied to solve classification problems. In sum, the goal of the toolbox is to fill the void in machine learning software when it comes to the challenging, costly, real-valued, problems faced in computational engineering. The toolbox is in use successfully at various institutions and we are continuously refining and extending the set of available plugins as the number of applications increase. Usage instructions, design documentation, and stable releases for all major platforms can be found at http://www.sumo.intec.ugent.be. References I. Couckuyt, D. Gorissen, H. Rouhani, E. Laermans, and T. Dhaene. Evolutionary regression modeling with active learning: An application to rainfall runoff modeling. In International Conference on Adaptive and Natural Computing Algorithms, volume LNCS 5495, pages 548–558, Sep. 2009. A. Forrester, A. Sobester, and A. Keane. Engineering Design Via Surrogate Modelling: A Practical Guide. Wiley, 2008. D. Gorissen, K. Crombecq, I. Couckuyt, and T. Dhaene. Foundations of Computational Intelligence, Volume 1: Learning and Approximation: Theoretical Foundations and Applications, volume 201, chapter Automatic approximation of expensive functions with active learning, pages 35–62. Springer Verlag, Series Studies in Computational Intelligence, 2009a. D. Gorissen, L. De Tommasi, K. Crombecq, and T. Dhaene. Sequential modeling of a low noise amplifier with neural networks and active learning. Neural Computing and Applications, 18(5): 485–494, Jun. 2009b. D. Gorissen, T. Dhaene, P. Demeester, and J. Broeckhove. Handbook of Research on Grid Technologies and Utility Computing: Concepts for Managing Large-Scale Applications, chapter Grid enabled surrogate modeling, pages 249–258. IGI Global, May 2009c. 2054 A S URROGATE M ODELING AND A DAPTIVE S AMPLING TOOLBOX FOR C OMPUTER BASED D ESIGN D. Gorissen, T. Dhaene, and F. DeTurck. Evolutionary model type selection for global surrogate modeling. Journal of Machine Learning Research, 10:2039–2078, 2009d. D. Gorissen, I. Couckuyt, E. Laermans, and T. Dhaene. Multiobjective global surrogate modeling,dealing with the 5-percent problem. Engineering with Computers, 26(1):81–89, Jan. 2010. D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, Nov. 1998. ISSN 0925-5001. T. W. Simpson, V. Toropov, V. Balabanov, and F. A. C. Viana. Design and analysis of computer experiments in multidisciplinary design optimization: a review of how far we have come or not. In Proceedings of the 12th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, 2008 MAO, Victoria, Canada, 2008. D.W. Stephens, D. Gorissen, and T. Dhaene. Surrogate based sensitivity analysis of process equipment. In Proc. of 7th International Conference on CFD in the Minerals and Process Industries, CSIRO, Melbourne, Australia, Dec. 2009. T. Zhu and P. D. Franzon. Application of surrogate modeling to generate compact and PVT-sensitive IBIS models. In Proceedings of the 18th Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS), Oct. 2009. 2055
4 0.35938936 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
5 0.2976509 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project
Author: Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten
Abstract: WEKA is a popular machine learning workbench with a development life of nearly two decades. This article provides an overview of the factors that we believe to be important to its success. Rather than focussing on the software’s functionality, we review aspects of project management and historical development decisions that likely had an impact on the uptake of the project. Keywords: machine learning software, open source software
6 0.2408234 15 jmlr-2010-Approximate Tree Kernels
7 0.23787296 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
8 0.22596888 35 jmlr-2010-Error-Correcting Output Codes Library
9 0.22571424 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
10 0.2098015 118 jmlr-2010-libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models
11 0.20486777 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
12 0.18812546 44 jmlr-2010-Graph Kernels
13 0.18770613 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine
14 0.18417953 37 jmlr-2010-Evolving Static Representations for Task Transfer
15 0.17853849 100 jmlr-2010-SFO: A Toolbox for Submodular Function Optimization
16 0.17435616 39 jmlr-2010-FastInf: An Efficient Approximate Inference Library
17 0.1697308 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
18 0.16012935 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
19 0.13925302 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
20 0.13417242 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
topicId topicWeight
[(21, 0.015), (24, 0.013), (25, 0.015), (32, 0.023), (36, 0.011), (37, 0.466), (56, 0.032), (75, 0.127), (85, 0.025), (94, 0.142), (96, 0.021)]
simIndex simValue paperId paperTitle
1 0.94916934 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
Author: Carl Edward Rasmussen, Hannes Nickisch
Abstract: The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace’s method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks. Keywords: Gaussian processes, nonparametric Bayes, probabilistic regression and classification Gaussian processes (GPs) (Rasmussen and Williams, 2006) have convenient properties for many modelling tasks in machine learning and statistics. They can be used to specify distributions over functions without having to commit to a specific functional form. Applications range from regression over classification to reinforcement learning, spatial models, survival and other time series1 models. Predictions of GP models come with a natural confidence measure: predictive error-bars. Although the implementation of the basic principles in the simplest case is straight forward, various complicating features are often desired in practice. For example, a GP is determined by a mean function and a covariance function, but these functions are mostly difficult to specify fully a priori, and typically they are given in terms of hyperparameters, that is, parameters which have to be inferred. Another source of difficulty is the likelihood function. For Gaussian likelihoods, inference is analytically tractable; however, in many tasks, Gaussian likelihoods are not appropriate, and approximate inference methods such as Expectation Propagation (EP) (Minka, 2001), Laplace’s approximation (LA) (Williams and Barber, 1998) and variational bounds (VB) (Gibbs and MacKay, 2000) become necessary (Nickisch and Rasmussen, 2008). In case of large training data, approximations (Candela and Rasmussen, 2005) like FITC (Snelson and Ghahramani, 2006) are needed. The GPML toolbox is designed to overcome these hurdles with its variety of mean, covariance and likelihood functions as well as inference methods, while being simple to use and easy to extend. ∗. Also at Max Planck Institute for Biological Cybernetics, Spemannstraße 38, 72076 T¨ bingen, Germany. u 1. Note, that here we typically think of GPs with a more general index set than time. ©2010 Carl Edward Rasmussen and Hannes Nickisch. R ASMUSSEN AND N ICKISCH 1. Implementation The GPML toolbox can be obtained from http://gaussianprocess.org/gpml/code/matlab/ and also http://mloss.org/software/view/263/ under the FreeBSD license. Based on simple interfaces for covariance, mean, likelihood functions as well as inference methods, we offer full compatibility to both Matlab 7.x2 and GNU Octave 3.2.x.3 Special attention has been given to properly disentangle covariance, likelihood and mean hyperparameters. Also, care has been taken to avoid numerical inaccuracies, for example, safe likelihood evaluations for extreme inputs and stable matrix operations. For example, the covariance matrix K can become numerically close to singular making its naive inversion numerically unsafe. We handle these situations in a principled way4 such that Cholesky decompositions are computed of well-conditioned matrices only. As a result, our code shows a high level of robustness along the full spectrum of possible hyperparameters. The focus of the toolbox is on approximate inference using dense matrix algebra. We currently do not support covariance matrix approximation techniques to deal with large numbers of training examples n. Looking at the (growing) body of literature on sparse approximations, this knowledge is still somewhat in flux, and consensus on the best approaches has not yet been reached. We provide stable and modular code checked by an exhaustive suite of test cases. A single function gp.m serves as main interface to the user—it can make inference and predictions and allows the mean, covariance and likelihood function as well as the inference methods to be specified freely. Furthermore, gp.m enables convenient learning of the hyperparameters by maximising the log marginal likelihood ln Z. One of the particularly appealing properties of GP models is that principled and practical approaches exist for learning the parameters of mean, covariance and likelihood functions. Good adaptation of such parameters can be essential to obtain both high quality predictions and insights into the properties of the data. The GPML toolbox is particularly flexible, including a large library of different covariance and mean functions, and flexible ways to combine these into more expressive, specialised functions. The user can choose between two gradient-based optimisers: one uses conjugate gradients (CG)5 and the other one relies on a quasi-Newton scheme.6 ∂ Computing the derivatives w.r.t. hyperparameters ∂θi ln Z with gp.m does not need any extra programming effort; every inference method automatically collects the respective derivatives from the mean, covariance and likelihood functions and passes them to gp.m. Our documentation comes in two pieces: a hypertext user documentation7 doc/index.html with examples and code browsing and a technical documentation8 doc/manual.pdf focusing on the interfaces and more technical issues. A casual user will use the hypertext document to quickly get his data analysed, however a power user will consult the pdf document once he wants to include his own mean, covariance, likelihood and inference routines or learn about implementation details. 2. 3. 4. 5. 6. 7. 8. Matlab is available from MathWorks, http://www.mathworks.com/. Octave is available from the Free Software Foundation, http://www.gnu.org/software/octave/. We do not consider the “blind” addition of a “small ridge” to K a principled way. Carl Rasmussen’s code is available at http://www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/. Peter Carbonetto’s wrapper can be found at http://www.cs.ubc.ca/˜pcarbo/lbfgsb-for-matlab.html. Documentation can be found at http://www.gaussianprocess.org/gpml/code/matlab/doc/index.html. Technical docs are available at http://www.gaussianprocess.org/gpml/code/matlab/doc/manual.pdf. 3012 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX 2. The GPML Toolbox We illustrate the modular structure of the GPML toolbox by means of a simple code example. GPs are used to formalise and update knowledge about distributions over functions. A GP prior distribution on an unknown latent function f ∼ GP (mφ (x), kψ (x, x′ )), consists of a mean function m(x) = E[ f (x)], and a covariance function k(x, x) = E[( f (x) − m(x))( f (x′ ) − m(x′ ))], both of which typically contain hyperparameters φ and ψ, which we want to fit in the light of data. We generally assume independent observations, that is, input/output pairs (xi , yi ) of f with joint likelihood Pρ (y|f) = ∏n Pρ (yi | f (xi )) factorising over cases. Finally, after specification of the prior and i=1 fitting of the hyperparameters θ = {φ, ψ, ρ}, we wish to compute predictive distributions for test cases. 1 2 3 4 5 6 7 % 1) SET UP THE GP : COVARIANCE ; MEAN , LIKELIHOOD , INFERENCE METHOD mf = { ’ meanSum ’ ,{ ’ meanLinear ’, @meanConst }}; a = 2; b = 1; % m(x) = a*x+b cf = { ’ covSEiso ’}; sf = 1; ell = 0.7; % squared exponential covariance funct lf = ’ likLaplace ’; sn = 0.2; % assume Laplace noise with variance sn ˆ2 hyp0 . mean = [a;b ]; hyp0 . cov = log([ ell ; sf ]); hyp0 . lik = log( sn ); % hypers inf = ’ infEP ’; % specify expectation propagation as inference method % 2) MINIMISE NEGATIVE LOG MARGINAL LIKELIHOOD nlZ wrt . hyp ; do 50 CG steps Ncg = 50; [hyp , nlZ ] = minimize ( hyp0 , ’gp ’, -Ncg , inf , mf , cf , lf , X , y ); % 3) PREDICT AT UNKNOWN TEST INPUTS [ymu , ys2 ] = gp (hyp , inf , mf , cf , lf , X , y , Xs ); % test input Xs In line 1, we specify the mean mφ (x) = a⊤ x + b of the GP with hyperparameters φ = {a, b}. First, the functional form of the mean function is given and its parameters are initialised. The desired mean function, happens not to exist in the library of mean functions; instead we have to make a composite mean function from simple constituents. This is done using a nested cell array containing the algebraic expression for m(x): As the sum of a linear (mean/meanLinear.m) and a constant mean function (mean/meanConst.m) it is an affine function. In addition to linear and constant mean functions, the toolbox offers m(x) = 0 and m(x) = 1. These simple mean functions can be combined by composite mean functions to obtain sums (mean/meanSum.m) m(x) = ∑ j m j (x), products m(x) = ∏ j m j (x), scaled versions m(x) = αm0 (x) and powers m(x) = m0 (x)d . This flexible mechanism is used for convenient specification of an extensible algebra of mean functions. Note that functions are referred to either as name strings ’meanConst’ or alternatively function handles @meanConst. The order of components of the hyperparameters φ is the same as in the specification of the cell array. Every mean function implements its evaluation m = mφ (X) and first derivative ∂ computation mi = ∂φi mφ (X) on a data set X. In the same spirit, the squared exponential covariance kψ (x, x′ ) = σ f ² exp(− x − x′ 2 /2ℓ2 ) (cov/covSEiso.m) with hyperparameters ψ = {ln ℓ, ln σ f } is set up in line 2. Note, that the hyperparameters are represented by the logarithms, as these parameters are naturally positive. Many other simple covariance functions are contained in the toolbox. Among others, we offer linear, constant, Mat´ rn, rational quadratic, polynomial, periodic, neural network and finite support coe variance functions. Composite covariance functions allow for sums k(x, x′ ) = ∑ j k j (x, x′ ), products k(x, x′ ) = ∏ j k j (x, x′ ), positive scaling k(x, x′ ) = σ2 k0 (x, x′ ) and masking of components f k(x, x′ ) = k0 (xI , x′ ) with I ⊆ [1, 2, .., D], x ∈ RD . Again, the interface is simple since only the I ∂ evaluation of the covariance matrix K = kψ (X) and its derivatives ∂i K = ∂ψi kψ (X) on a data set X are required. Furthermore, we need cross terms k∗ = kψ (X, x∗ ) and k∗∗ = kψ (x∗ , x∗ ) for prediction. There are no restrictions on the composition of both mean and covariance functions—any combination is allowed including nested composition. 3013 R ASMUSSEN AND N ICKISCH √ √ The Laplace (lik/likLaplace.m) likelihood Pρ (y| f ) = exp(− 2/σn |y − f |)/ 2σn with hyperparameters ρ = {ln σn } is specified in line 3. There are only simple likelihood functions: Gaussian, Sech-squared, Laplacian and Student’s t for ordinary and sparse regression as well as the error and the logistic function for classification. Again, the same inference code is used for any likelihood function. Although the specification of likelihood functions is simple for the user, writing new likelihood functions is slightly more involved as different inference methods require access to different properties; for example, LA requires second derivatives and EP requires derivatives of moments. All hyperparameters θ = {φ, ψ, ρ} are stored in a struct hyp.{mean,cov,lik}, which is initialised in line 4; we select the approximate inference algorithm EP (inf/infEP.m) in line 5. We optimise the hyperparameters θ ≡ hyp by calling the CG optimiser (util/minimize.m) with initial value θ0 ≡ hyp0 in line 6 allowing at most N = 50 evaluations of the EP approximation to the marginal likelihood ZEP (θ) as done by gp.m. Here, D = (X, y) ≡ (X,y) is the training data where X = {x1 , .., xn } and y ∈ Rn . Under the hood, gp.m computes in every step a Gaussian ∂ posterior approximation and the derivatives ∂θ ln ZEP (θ) of the marginal likelihood by calling EP. Predictions with optimised hyperparameters are done in line 7, where we call gp.m with the unseen test inputs X∗ ≡ Xs as additional argument. As a result, we obtain the approximate marginal predictive mean E[P(y∗ |D , X∗ )] ≡ ymu and the predictive variance V[P(y∗ |D , X∗ )] ≡ ys2. Likelihood \ Inference Gaussian Sech-squared Laplacian Student’s t Error function Logistic function Exact FITC EP Laplace VB Type, Output Domain regression, R regression, R regression, R regression, R classification, {±1} classification, {±1} Alternate Name logistic distribution double exponential probit regression logit regression Table 1: Likelihood ↔ inference compatibility in the GPML toolbox Table 1 gives the legal likelihood/inference combinations. Exact inference and the FITC approximation support the Gaussian likelihood only. Variational Bayesian (VB) inference is applicable to all likelihoods. Expectation propagation (EP) for the Student’s t likelihood is inherently unstable due to its non-log-concavity. The Laplace approximation (LA) for Laplace likelihoods is not sensible due to the non-differentiable peak of the Laplace likelihood. Special care has been taken for the non-convex optimisation problem imposed by the combination Student’s t likelihood and LA. If the number of training examples is larger than a few thousand, dense matrix computations become too slow. We provide the FITC approximation for regression with Gaussian likelihood where ˜ instead of the exact covariance matrix K, a low-rank plus diagonal matrix K = Q + diag(K − Q) ⊤ K−1 K is used. The matrices K and K contain covariances and cross-covariances where Q = Ku uu u uu u of and between inducing inputs ui and data points x j . Using inf/infFITC.m together with any covariance function wrapped into cov/covFITC.m makes the computations feasible for large n. Acknowledgments Thanks to Ed Snelson for assisting with the FITC approximation. 3014 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX References Joaquin Qui˜ onero Candela and Carl E. Rasmussen. A unifying view of sparse approximate Gausn sian process regression. Journal of Machine Learning Research, 6(6):1935–1959, 2005. Mark N. Gibbs and David J. C. MacKay. Variational Gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6):1458–1464, 2000. Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362–369. Morgan Kaufmann, 2001. Hannes Nickisch and Carl E. Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9:2035–2078, 10 2008. Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006. Ed Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, 2006. Christopher K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(20):1342–1351, 1998. 3015
2 0.89394277 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu
Abstract: Methods based on ℓ1 -relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisfies the restricted nullspace property, and (2) the squared ℓ2 -error of a Lasso estimate decays at the minimax optimal rate k log p , where k is the sparsity of the p-dimensional regression problem with additive n Gaussian noise, whenever the design satisfies a restricted eigenvalue condition. The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisfied when all entries of X are independent and identically distributed (i.i.d.), or the rows are unitary. This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theoretical guarantees on ℓ1 -relaxations to a much broader class of problems than the case of completely independent or unitary designs. Keywords: Lasso, basis pursuit, random matrix theory, Gaussian comparison inequality, concentration of measure
same-paper 3 0.88185275 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
Author: Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojtěch Franc
Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the specific algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. Keywords: support vector machines, kernels, large-scale learning, Python, Octave, R
4 0.8134625 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
Author: Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin
Abstract: Maximum entropy (Maxent) is useful in natural language processing and many other areas. Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent. With many variants of IS methods, it is difficult to understand them and see the differences. In this paper, we create a general and unified framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods. We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods. Keywords: maximum entropy, iterative scaling, coordinate descent, natural language processing, optimization
5 0.81226724 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
Author: Gideon S. Mann, Andrew McCallum
Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE fits model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random fields. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random fields
6 0.65279692 118 jmlr-2010-libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models
7 0.64699662 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression
8 0.55178905 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project
9 0.52002424 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
10 0.51506317 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
11 0.49912056 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
12 0.49675822 22 jmlr-2010-Classification Using Geometric Level Sets
13 0.49330267 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls
14 0.46738318 109 jmlr-2010-Stochastic Composite Likelihood
15 0.46594721 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
16 0.46279117 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
17 0.46119583 100 jmlr-2010-SFO: A Toolbox for Submodular Function Optimization
18 0.4599978 15 jmlr-2010-Approximate Tree Kernels
19 0.44932199 93 jmlr-2010-PyBrain
20 0.44731972 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide