jmlr jmlr2010 jmlr2010-100 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andreas Krause
Abstract: In recent years, a fundamental problem structure has emerged as very useful in a variety of machine learning applications: Submodularity is an intuitive diminishing returns property, stating that adding an element to a smaller set helps more than adding it to a larger set. Similarly to convexity, submodularity allows one to efficiently find provably (near-) optimal solutions for large problems. We present SFO, a toolbox for use in MATLAB or Octave that implements algorithms for minimization and maximization of submodular functions. A tutorial script illustrates the application of submodularity to machine learning and AI problems such as feature selection, clustering, inference and optimized information gathering.
Reference: text
sentIndex sentText sentNum sentScore
1 Similarly to convexity, submodularity allows one to efficiently find provably (near-) optimal solutions for large problems. [sent-3, score-0.139]
2 We present SFO, a toolbox for use in MATLAB or Octave that implements algorithms for minimization and maximization of submodular functions. [sent-4, score-0.725]
3 A tutorial script illustrates the application of submodularity to machine learning and AI problems such as feature selection, clustering, inference and optimized information gathering. [sent-5, score-0.17]
4 More formally, they require finding a solution x∗ ∈ Rd : x∗ = argmin g(x) s. [sent-8, score-0.017]
5 However, many optimization problems in machine learning, such as feature selection, structure learning and inference in discrete graphical models, require finding solutions to combinatorial optimization problems: They can be reduced to the problem A ∗ = argmin F(A ) s. [sent-11, score-0.057]
6 In many machine learning problems, the function F satisfies submodularity, an intuitive diminishing returns property, stating that adding an element to a smaller set helps more than adding it to a larger set. [sent-14, score-0.129]
7 Similarly to convexity, submodularity allows one to efficiently find provably (near-) optimal solutions for large problems. [sent-16, score-0.139]
8 Interestingly, for submodular functions, guarantees can be obtained both for minimization and for maximization problems. [sent-17, score-0.657]
9 This is important, since applications require both minimization (e. [sent-18, score-0.025]
10 , in clustering, inference and structure learning) and maximization (e. [sent-20, score-0.039]
11 K RAUSE toolbox1 for use in MATLAB or Octave that implements various algorithms for minimization and maximization of submodular functions. [sent-24, score-0.686]
12 Examples illustrate the application of submodularity to machine learning and AI problems such as clustering (Narasimhan et al. [sent-25, score-0.151]
13 , 2005), inference in graphical models (Kolmogorov and Zabih, 2004) and optimized information gathering (Krause et al. [sent-26, score-0.041]
14 Implementation of Submodular Functions The SFO toolbox includes several examples of submodular functions. [sent-29, score-0.632]
15 Submodular functions are implemented as MATLAB objects, inheriting from sfo fn. [sent-32, score-0.751]
16 , 2008), structure learning (Narasimhan and Bilmes, 2004) and clustering (Narasimhan et al. [sent-34, score-0.032]
17 Often, algorithms require computing marginal increments δ+ (A ) = F(A ∪ {s}) − F(A ) and δ− (A ) = F(A \ {s}) − F(A ), s s that is, computing the change in submodular value by adding (removing) an element s from a set A . [sent-36, score-0.624]
18 , for mutual information, incrementally computing F(A ∪ {s}) requires up-/downdating of the Cholesky decomposition of covariance matrix ΣAA . [sent-40, score-0.026]
19 1142 SFO: A T OOLBOX FOR S UBMODULAR F UNCTION O PTIMIZATION Creating submodular functions from other submodular functions is also possible, using sfo fn lincomb for nonnegative linear combinations, and sfo fn trunc for truncation. [sent-45, score-2.844]
20 Custom submodular functions can be used either by inheriting from sfo fn, or by using the sfo fn wrapper function, which wraps a pointer to an anonymous function in a submodular function object. [sent-46, score-2.838]
21 The following example wraps an anonymous function fn which computes, for any set of integers A , the number of distinct remainders modulo 5: f n = @(A) l e n g t h ( u n i q u e ( mod (A , 5 ) ) ) ; F = sfo fn wrapper ( fn ) ; F([1 6]) % returns 1 F([1:10]) % returns 5 3. [sent-47, score-1.17]
22 Implemented Algorithms for Submodular Function Optimization SFO implements various algorithms for (constrained) maximization and minimization of submodular functions. [sent-48, score-0.686]
23 Their use is demonstrated in sfo tutorial and sfo tutorial octave. [sent-49, score-1.468]
24 Minimization of Submodular Functions • sfo min norm point: The minimum norm point algorithm of Fujishige (2005) for solving A ∗ = argminA ⊆V F(A ) for general submodular functions. [sent-50, score-1.316]
25 • sfo queyranne: Algorithm of Queyranne (1995) solving A ∗ = argminA ⊆V :0<|A |<|V | F(A ) for symmetric submodular functions (i. [sent-51, score-1.316]
26 • sfo ssp: The submodular-supermodular procedure of Narasimhan and Bilmes (2006) for (heuristically) minimizing the difference between two submodular functions A ∗ = argminA ⊆V F1 (A ) − F2 (A ). [sent-54, score-1.304]
27 / • sfo s t min cut: Solves A ∗ = argminA ⊆V F(A ) s. [sent-55, score-0.711]
28 • sfo greedy splitting: The algorithm of Zhao et al. [sent-58, score-0.748]
29 (2005) for submodular clustering Maximization of Submodular Functions • sfo greedy lazy: The greedy algorithm of Nemhauser et al. [sent-59, score-1.41]
30 (1978) for constrained maximization / coverage, using the lazy evaluation technique of Minoux (1978). [sent-60, score-0.113]
31 • sfo cover: Greedy coverage algorithm using lazy evaluations. [sent-61, score-0.808]
32 • sfo celf: The CELF algorithm for approximately solving A ∗ = argmaxA F(A ) s. [sent-62, score-0.736]
33 • sfo ls lazy: The (deterministic) local search algorithm of Feige et al. [sent-66, score-0.711]
34 (2007) for unconstrained maximization of nonnegative submodular functions, using lazy evaluations. [sent-67, score-0.718]
35 • sfo pspiel: The P SPIEL algorithm of Krause et al. [sent-68, score-0.711]
36 P SPIEL approximately solves A ∗ = argmaxA F(A ) s. [sent-70, score-0.026]
37 • sfo saturate: The SATURATE algorithm of Krause et al. [sent-73, score-0.711]
38 (2008) for approximately solving the robust optimization problem A ∗ = argmax|A |≤k mini Fi (A ). [sent-74, score-0.066]
39 • sfo balance: The E SPASS algorithm for approximately solving the optimization problem max|A1 ∪···∪Ak |≤m mini F(Ai ) (Krause et al. [sent-75, score-0.777]
40 • sfo max dca lazy: The Data Correcting algorithm for maximizing general (not necessarily nondecreasing) submodular functions (Goldengorin et al. [sent-77, score-1.326]
41 The data-correcting algorithm for the minimization of supermodular functions. [sent-96, score-0.042]
42 What energy functions can be minimized via graph cuts? [sent-101, score-0.015]
43 Near-optimal sensor placements: Maximizing information while minimizing communication cost. [sent-108, score-0.033]
44 Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. [sent-114, score-0.073]
45 An analysis of the approximations for maximizing submodular set functions. [sent-156, score-0.615]
wordName wordTfidf (topN-words)
[('sfo', 0.711), ('submodular', 0.593), ('krause', 0.169), ('submodularity', 0.119), ('fn', 0.112), ('narasimhan', 0.106), ('argmina', 0.079), ('xv', 0.075), ('lazy', 0.074), ('argmaxa', 0.04), ('celf', 0.04), ('feige', 0.04), ('goldengorin', 0.04), ('inheriting', 0.04), ('nemhauser', 0.04), ('placements', 0.04), ('queyranne', 0.04), ('rause', 0.04), ('saturate', 0.04), ('spiel', 0.04), ('wraps', 0.04), ('maximization', 0.039), ('toolbox', 0.039), ('greedy', 0.037), ('gaussians', 0.035), ('leskovec', 0.034), ('xa', 0.033), ('sensor', 0.033), ('clustering', 0.032), ('guestrin', 0.03), ('diminishing', 0.03), ('bilmes', 0.03), ('implements', 0.029), ('gathering', 0.028), ('mini', 0.028), ('mutual', 0.026), ('inc', 0.026), ('octave', 0.026), ('minimization', 0.025), ('gupta', 0.025), ('ising', 0.025), ('returns', 0.024), ('coverage', 0.023), ('tutorial', 0.023), ('multivariate', 0.022), ('maximizing', 0.022), ('dec', 0.021), ('kolmogorov', 0.021), ('wrapper', 0.021), ('matlab', 0.021), ('andreas', 0.02), ('provably', 0.02), ('cut', 0.018), ('zhao', 0.017), ('supermodular', 0.017), ('caltech', 0.017), ('pasadena', 0.017), ('sigma', 0.017), ('faloutsos', 0.017), ('pami', 0.017), ('multiway', 0.017), ('extendable', 0.017), ('pointer', 0.017), ('argmin', 0.017), ('stating', 0.017), ('adding', 0.016), ('custom', 0.015), ('script', 0.015), ('increments', 0.015), ('oolbox', 0.015), ('energy', 0.015), ('splitting', 0.014), ('ptimization', 0.014), ('emerged', 0.014), ('cuts', 0.014), ('focs', 0.014), ('corporation', 0.014), ('placement', 0.014), ('modulo', 0.014), ('trans', 0.014), ('combinatorial', 0.014), ('objects', 0.014), ('helps', 0.014), ('solves', 0.013), ('optimized', 0.013), ('optimization', 0.013), ('approximately', 0.013), ('detection', 0.013), ('cache', 0.012), ('scheduling', 0.012), ('lncs', 0.012), ('nonnegative', 0.012), ('solving', 0.012), ('heuristically', 0.012), ('nondecreasing', 0.012), ('correcting', 0.012), ('aa', 0.012), ('intuitive', 0.012), ('nips', 0.012), ('ai', 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 100 jmlr-2010-SFO: A Toolbox for Submodular Function Optimization
Author: Andreas Krause
Abstract: In recent years, a fundamental problem structure has emerged as very useful in a variety of machine learning applications: Submodularity is an intuitive diminishing returns property, stating that adding an element to a smaller set helps more than adding it to a larger set. Similarly to convexity, submodularity allows one to efficiently find provably (near-) optimal solutions for large problems. We present SFO, a toolbox for use in MATLAB or Octave that implements algorithms for minimization and maximization of submodular functions. A tutorial script illustrates the application of submodularity to machine learning and AI problems such as feature selection, clustering, inference and optimized information gathering.
2 0.020492114 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu
Abstract: Methods based on ℓ1 -relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisfies the restricted nullspace property, and (2) the squared ℓ2 -error of a Lasso estimate decays at the minimax optimal rate k log p , where k is the sparsity of the p-dimensional regression problem with additive n Gaussian noise, whenever the design satisfies a restricted eigenvalue condition. The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisfied when all entries of X are independent and identically distributed (i.i.d.), or the rows are unitary. This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theoretical guarantees on ℓ1 -relaxations to a much broader class of problems than the case of completely independent or unitary designs. Keywords: Lasso, basis pursuit, random matrix theory, Gaussian comparison inequality, concentration of measure
3 0.017625943 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
Author: Franz Pernkopf, Jeff A. Bilmes
Abstract: We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classifiers. We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classification rate (i.e., risk), respectively. Given an ordering, we can find a discriminative structure with O N k+1 score evaluations (where constant k is the tree-width of the sub-graph over the attributes). We present results on 25 data sets from the UCI repository, for phonetic classification using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classification performance for both discriminative and generative parameter learning on both discriminatively and generatively structured networks. The discriminative structure found by our new procedures significantly outperforms generatively produced structures, and achieves a classification accuracy on par with the best discriminative (greedy) Bayesian network learning approach, but does so with a factor of ∼10-40 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classifiers still hold in the case of missing features, a case where generative classifiers have an advantage over discriminative classifiers. Keywords: Bayesian networks, classification, discriminative learning, structure learning, graphical model, missing feature
4 0.016368208 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
Author: Carl Edward Rasmussen, Hannes Nickisch
Abstract: The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace’s method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks. Keywords: Gaussian processes, nonparametric Bayes, probabilistic regression and classification Gaussian processes (GPs) (Rasmussen and Williams, 2006) have convenient properties for many modelling tasks in machine learning and statistics. They can be used to specify distributions over functions without having to commit to a specific functional form. Applications range from regression over classification to reinforcement learning, spatial models, survival and other time series1 models. Predictions of GP models come with a natural confidence measure: predictive error-bars. Although the implementation of the basic principles in the simplest case is straight forward, various complicating features are often desired in practice. For example, a GP is determined by a mean function and a covariance function, but these functions are mostly difficult to specify fully a priori, and typically they are given in terms of hyperparameters, that is, parameters which have to be inferred. Another source of difficulty is the likelihood function. For Gaussian likelihoods, inference is analytically tractable; however, in many tasks, Gaussian likelihoods are not appropriate, and approximate inference methods such as Expectation Propagation (EP) (Minka, 2001), Laplace’s approximation (LA) (Williams and Barber, 1998) and variational bounds (VB) (Gibbs and MacKay, 2000) become necessary (Nickisch and Rasmussen, 2008). In case of large training data, approximations (Candela and Rasmussen, 2005) like FITC (Snelson and Ghahramani, 2006) are needed. The GPML toolbox is designed to overcome these hurdles with its variety of mean, covariance and likelihood functions as well as inference methods, while being simple to use and easy to extend. ∗. Also at Max Planck Institute for Biological Cybernetics, Spemannstraße 38, 72076 T¨ bingen, Germany. u 1. Note, that here we typically think of GPs with a more general index set than time. ©2010 Carl Edward Rasmussen and Hannes Nickisch. R ASMUSSEN AND N ICKISCH 1. Implementation The GPML toolbox can be obtained from http://gaussianprocess.org/gpml/code/matlab/ and also http://mloss.org/software/view/263/ under the FreeBSD license. Based on simple interfaces for covariance, mean, likelihood functions as well as inference methods, we offer full compatibility to both Matlab 7.x2 and GNU Octave 3.2.x.3 Special attention has been given to properly disentangle covariance, likelihood and mean hyperparameters. Also, care has been taken to avoid numerical inaccuracies, for example, safe likelihood evaluations for extreme inputs and stable matrix operations. For example, the covariance matrix K can become numerically close to singular making its naive inversion numerically unsafe. We handle these situations in a principled way4 such that Cholesky decompositions are computed of well-conditioned matrices only. As a result, our code shows a high level of robustness along the full spectrum of possible hyperparameters. The focus of the toolbox is on approximate inference using dense matrix algebra. We currently do not support covariance matrix approximation techniques to deal with large numbers of training examples n. Looking at the (growing) body of literature on sparse approximations, this knowledge is still somewhat in flux, and consensus on the best approaches has not yet been reached. We provide stable and modular code checked by an exhaustive suite of test cases. A single function gp.m serves as main interface to the user—it can make inference and predictions and allows the mean, covariance and likelihood function as well as the inference methods to be specified freely. Furthermore, gp.m enables convenient learning of the hyperparameters by maximising the log marginal likelihood ln Z. One of the particularly appealing properties of GP models is that principled and practical approaches exist for learning the parameters of mean, covariance and likelihood functions. Good adaptation of such parameters can be essential to obtain both high quality predictions and insights into the properties of the data. The GPML toolbox is particularly flexible, including a large library of different covariance and mean functions, and flexible ways to combine these into more expressive, specialised functions. The user can choose between two gradient-based optimisers: one uses conjugate gradients (CG)5 and the other one relies on a quasi-Newton scheme.6 ∂ Computing the derivatives w.r.t. hyperparameters ∂θi ln Z with gp.m does not need any extra programming effort; every inference method automatically collects the respective derivatives from the mean, covariance and likelihood functions and passes them to gp.m. Our documentation comes in two pieces: a hypertext user documentation7 doc/index.html with examples and code browsing and a technical documentation8 doc/manual.pdf focusing on the interfaces and more technical issues. A casual user will use the hypertext document to quickly get his data analysed, however a power user will consult the pdf document once he wants to include his own mean, covariance, likelihood and inference routines or learn about implementation details. 2. 3. 4. 5. 6. 7. 8. Matlab is available from MathWorks, http://www.mathworks.com/. Octave is available from the Free Software Foundation, http://www.gnu.org/software/octave/. We do not consider the “blind” addition of a “small ridge” to K a principled way. Carl Rasmussen’s code is available at http://www.kyb.tuebingen.mpg.de/bs/people/carl/code/minimize/. Peter Carbonetto’s wrapper can be found at http://www.cs.ubc.ca/˜pcarbo/lbfgsb-for-matlab.html. Documentation can be found at http://www.gaussianprocess.org/gpml/code/matlab/doc/index.html. Technical docs are available at http://www.gaussianprocess.org/gpml/code/matlab/doc/manual.pdf. 3012 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX 2. The GPML Toolbox We illustrate the modular structure of the GPML toolbox by means of a simple code example. GPs are used to formalise and update knowledge about distributions over functions. A GP prior distribution on an unknown latent function f ∼ GP (mφ (x), kψ (x, x′ )), consists of a mean function m(x) = E[ f (x)], and a covariance function k(x, x) = E[( f (x) − m(x))( f (x′ ) − m(x′ ))], both of which typically contain hyperparameters φ and ψ, which we want to fit in the light of data. We generally assume independent observations, that is, input/output pairs (xi , yi ) of f with joint likelihood Pρ (y|f) = ∏n Pρ (yi | f (xi )) factorising over cases. Finally, after specification of the prior and i=1 fitting of the hyperparameters θ = {φ, ψ, ρ}, we wish to compute predictive distributions for test cases. 1 2 3 4 5 6 7 % 1) SET UP THE GP : COVARIANCE ; MEAN , LIKELIHOOD , INFERENCE METHOD mf = { ’ meanSum ’ ,{ ’ meanLinear ’, @meanConst }}; a = 2; b = 1; % m(x) = a*x+b cf = { ’ covSEiso ’}; sf = 1; ell = 0.7; % squared exponential covariance funct lf = ’ likLaplace ’; sn = 0.2; % assume Laplace noise with variance sn ˆ2 hyp0 . mean = [a;b ]; hyp0 . cov = log([ ell ; sf ]); hyp0 . lik = log( sn ); % hypers inf = ’ infEP ’; % specify expectation propagation as inference method % 2) MINIMISE NEGATIVE LOG MARGINAL LIKELIHOOD nlZ wrt . hyp ; do 50 CG steps Ncg = 50; [hyp , nlZ ] = minimize ( hyp0 , ’gp ’, -Ncg , inf , mf , cf , lf , X , y ); % 3) PREDICT AT UNKNOWN TEST INPUTS [ymu , ys2 ] = gp (hyp , inf , mf , cf , lf , X , y , Xs ); % test input Xs In line 1, we specify the mean mφ (x) = a⊤ x + b of the GP with hyperparameters φ = {a, b}. First, the functional form of the mean function is given and its parameters are initialised. The desired mean function, happens not to exist in the library of mean functions; instead we have to make a composite mean function from simple constituents. This is done using a nested cell array containing the algebraic expression for m(x): As the sum of a linear (mean/meanLinear.m) and a constant mean function (mean/meanConst.m) it is an affine function. In addition to linear and constant mean functions, the toolbox offers m(x) = 0 and m(x) = 1. These simple mean functions can be combined by composite mean functions to obtain sums (mean/meanSum.m) m(x) = ∑ j m j (x), products m(x) = ∏ j m j (x), scaled versions m(x) = αm0 (x) and powers m(x) = m0 (x)d . This flexible mechanism is used for convenient specification of an extensible algebra of mean functions. Note that functions are referred to either as name strings ’meanConst’ or alternatively function handles @meanConst. The order of components of the hyperparameters φ is the same as in the specification of the cell array. Every mean function implements its evaluation m = mφ (X) and first derivative ∂ computation mi = ∂φi mφ (X) on a data set X. In the same spirit, the squared exponential covariance kψ (x, x′ ) = σ f ² exp(− x − x′ 2 /2ℓ2 ) (cov/covSEiso.m) with hyperparameters ψ = {ln ℓ, ln σ f } is set up in line 2. Note, that the hyperparameters are represented by the logarithms, as these parameters are naturally positive. Many other simple covariance functions are contained in the toolbox. Among others, we offer linear, constant, Mat´ rn, rational quadratic, polynomial, periodic, neural network and finite support coe variance functions. Composite covariance functions allow for sums k(x, x′ ) = ∑ j k j (x, x′ ), products k(x, x′ ) = ∏ j k j (x, x′ ), positive scaling k(x, x′ ) = σ2 k0 (x, x′ ) and masking of components f k(x, x′ ) = k0 (xI , x′ ) with I ⊆ [1, 2, .., D], x ∈ RD . Again, the interface is simple since only the I ∂ evaluation of the covariance matrix K = kψ (X) and its derivatives ∂i K = ∂ψi kψ (X) on a data set X are required. Furthermore, we need cross terms k∗ = kψ (X, x∗ ) and k∗∗ = kψ (x∗ , x∗ ) for prediction. There are no restrictions on the composition of both mean and covariance functions—any combination is allowed including nested composition. 3013 R ASMUSSEN AND N ICKISCH √ √ The Laplace (lik/likLaplace.m) likelihood Pρ (y| f ) = exp(− 2/σn |y − f |)/ 2σn with hyperparameters ρ = {ln σn } is specified in line 3. There are only simple likelihood functions: Gaussian, Sech-squared, Laplacian and Student’s t for ordinary and sparse regression as well as the error and the logistic function for classification. Again, the same inference code is used for any likelihood function. Although the specification of likelihood functions is simple for the user, writing new likelihood functions is slightly more involved as different inference methods require access to different properties; for example, LA requires second derivatives and EP requires derivatives of moments. All hyperparameters θ = {φ, ψ, ρ} are stored in a struct hyp.{mean,cov,lik}, which is initialised in line 4; we select the approximate inference algorithm EP (inf/infEP.m) in line 5. We optimise the hyperparameters θ ≡ hyp by calling the CG optimiser (util/minimize.m) with initial value θ0 ≡ hyp0 in line 6 allowing at most N = 50 evaluations of the EP approximation to the marginal likelihood ZEP (θ) as done by gp.m. Here, D = (X, y) ≡ (X,y) is the training data where X = {x1 , .., xn } and y ∈ Rn . Under the hood, gp.m computes in every step a Gaussian ∂ posterior approximation and the derivatives ∂θ ln ZEP (θ) of the marginal likelihood by calling EP. Predictions with optimised hyperparameters are done in line 7, where we call gp.m with the unseen test inputs X∗ ≡ Xs as additional argument. As a result, we obtain the approximate marginal predictive mean E[P(y∗ |D , X∗ )] ≡ ymu and the predictive variance V[P(y∗ |D , X∗ )] ≡ ys2. Likelihood \ Inference Gaussian Sech-squared Laplacian Student’s t Error function Logistic function Exact FITC EP Laplace VB Type, Output Domain regression, R regression, R regression, R regression, R classification, {±1} classification, {±1} Alternate Name logistic distribution double exponential probit regression logit regression Table 1: Likelihood ↔ inference compatibility in the GPML toolbox Table 1 gives the legal likelihood/inference combinations. Exact inference and the FITC approximation support the Gaussian likelihood only. Variational Bayesian (VB) inference is applicable to all likelihoods. Expectation propagation (EP) for the Student’s t likelihood is inherently unstable due to its non-log-concavity. The Laplace approximation (LA) for Laplace likelihoods is not sensible due to the non-differentiable peak of the Laplace likelihood. Special care has been taken for the non-convex optimisation problem imposed by the combination Student’s t likelihood and LA. If the number of training examples is larger than a few thousand, dense matrix computations become too slow. We provide the FITC approximation for regression with Gaussian likelihood where ˜ instead of the exact covariance matrix K, a low-rank plus diagonal matrix K = Q + diag(K − Q) ⊤ K−1 K is used. The matrices K and K contain covariances and cross-covariances where Q = Ku uu u uu u of and between inducing inputs ui and data points x j . Using inf/infFITC.m together with any covariance function wrapped into cov/covFITC.m makes the computations feasible for large n. Acknowledgments Thanks to Ed Snelson for assisting with the FITC approximation. 3014 G AUSSIAN P ROCESSES FOR M ACHINE L EARNING T OOLBOX References Joaquin Qui˜ onero Candela and Carl E. Rasmussen. A unifying view of sparse approximate Gausn sian process regression. Journal of Machine Learning Research, 6(6):1935–1959, 2005. Mark N. Gibbs and David J. C. MacKay. Variational Gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6):1458–1464, 2000. Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362–369. Morgan Kaufmann, 2001. Hannes Nickisch and Carl E. Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9:2035–2078, 10 2008. Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006. Ed Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, 2006. Christopher K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(20):1342–1351, 1998. 3015
5 0.015506107 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
7 0.013950717 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
8 0.013098816 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
9 0.01300592 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
10 0.012008333 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design
11 0.011857728 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
12 0.011291983 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
13 0.010790516 118 jmlr-2010-libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models
14 0.01077931 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
15 0.010410314 109 jmlr-2010-Stochastic Composite Likelihood
16 0.010346917 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks
17 0.0097712716 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
18 0.0097068353 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
19 0.0083380556 69 jmlr-2010-Lp-Nested Symmetric Distributions
20 0.0077821747 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
topicId topicWeight
[(0, -0.04), (1, -0.006), (2, 0.0), (3, -0.006), (4, -0.016), (5, 0.007), (6, 0.004), (7, -0.019), (8, -0.036), (9, 0.035), (10, -0.013), (11, -0.018), (12, -0.004), (13, -0.002), (14, -0.024), (15, -0.013), (16, -0.028), (17, -0.01), (18, 0.003), (19, -0.014), (20, -0.034), (21, -0.001), (22, -0.023), (23, -0.055), (24, -0.001), (25, -0.018), (26, -0.034), (27, -0.055), (28, 0.016), (29, 0.021), (30, -0.001), (31, -0.03), (32, -0.022), (33, -0.09), (34, -0.017), (35, 0.058), (36, -0.185), (37, 0.115), (38, -0.062), (39, 0.053), (40, -0.09), (41, 0.073), (42, -0.006), (43, 0.142), (44, 0.477), (45, -0.627), (46, -0.216), (47, 0.165), (48, -0.033), (49, -0.287)]
simIndex simValue paperId paperTitle
same-paper 1 0.98966128 100 jmlr-2010-SFO: A Toolbox for Submodular Function Optimization
Author: Andreas Krause
Abstract: In recent years, a fundamental problem structure has emerged as very useful in a variety of machine learning applications: Submodularity is an intuitive diminishing returns property, stating that adding an element to a smaller set helps more than adding it to a larger set. Similarly to convexity, submodularity allows one to efficiently find provably (near-) optimal solutions for large problems. We present SFO, a toolbox for use in MATLAB or Octave that implements algorithms for minimization and maximization of submodular functions. A tutorial script illustrates the application of submodularity to machine learning and AI problems such as feature selection, clustering, inference and optimized information gathering.
2 0.13722219 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
Author: Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojtěch Franc
Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the specific algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. Keywords: support vector machines, kernels, large-scale learning, Python, Octave, R
3 0.12556425 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
4 0.10277694 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
Author: Franz Pernkopf, Jeff A. Bilmes
Abstract: We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classifiers. We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classification rate (i.e., risk), respectively. Given an ordering, we can find a discriminative structure with O N k+1 score evaluations (where constant k is the tree-width of the sub-graph over the attributes). We present results on 25 data sets from the UCI repository, for phonetic classification using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classification performance for both discriminative and generative parameter learning on both discriminatively and generatively structured networks. The discriminative structure found by our new procedures significantly outperforms generatively produced structures, and achieves a classification accuracy on par with the best discriminative (greedy) Bayesian network learning approach, but does so with a factor of ∼10-40 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classifiers still hold in the case of missing features, a case where generative classifiers have an advantage over discriminative classifiers. Keywords: Bayesian networks, classification, discriminative learning, structure learning, graphical model, missing feature
5 0.096571527 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design
Author: Dirk Gorissen, Ivo Couckuyt, Piet Demeester, Tom Dhaene, Karel Crombecq
Abstract: An exceedingly large number of scientific and engineering fields are confronted with the need for computer simulations to study complex, real world phenomena or solve challenging design problems. However, due to the computational cost of these high fidelity simulations, the use of neural networks, kernel methods, and other surrogate modeling techniques have become indispensable. Surrogate models are compact and cheap to evaluate, and have proven very useful for tasks such as optimization, design space exploration, prototyping, and sensitivity analysis. Consequently, in many fields there is great interest in tools and techniques that facilitate the construction of such regression models, while minimizing the computational cost and maximizing model accuracy. This paper presents a mature, flexible, and adaptive machine learning toolkit for regression modeling and active learning to tackle these issues. The toolkit brings together algorithms for data fitting, model selection, sample selection (active learning), hyperparameter optimization, and distributed computing in order to empower a domain expert to efficiently generate an accurate model for the problem or data at hand. Keywords: surrogate modeling, metamodeling, function approximation, model selection, adaptive sampling, active learning, distributed computing 1. Background and Motivation In many science and engineering problems researchers make heavy use of computer simulation codes in order to replace expensive physical experiments and improve the quality and performance of engineered products and devices. Such simulation activities are collectively referred to as computational science/engineering. Unfortunately, while allowing scientists more flexibility to study phenomena under controlled conditions, computer simulations require a substantial investment of c 2010 Dirk Gorissen, Ivo Couckuyt, Piet Demeester, Tom Dhaene and Karel Crombecq. G ORISSEN , C OUCKUYT, D EMEESTER , D HAENE AND C ROMBECQ computation time. One simulation may take many minutes, hours, days or even weeks, quickly rendering parameter studies impractical (Forrester et al., 2008; Simpson et al., 2008). Of the different ways to deal with this problem, this paper is concerned with the construction of simpler approximation models to predict the system performance and develop a relationship between the system inputs and outputs. When properly constructed, these approximation models mimic the behavior of the simulation accurately while being computationally cheap(er) to evaluate. Different approximation methods exist, each with their relative merits. This work concentrates on the use of data-driven, global approximations using compact surrogate models (also known as metamodels, replacement models, or response surface models). Examples include: rational functions, Kriging models, Artificial Neural Networks (ANN), splines, and Support Vector Machines (SVM). Once such a global approximation is available it is of great use for gaining insight into the behavior of the underlying system. The surrogate may be easily queried, optimized, visualized, and seamlessly integrated into CAD/CAE software packages. The challenge is thus how to generate an approximation model that is as accurate as possible over the complete domain of interest while minimizing the simulation cost. Solving this challenge involves multiple sub-problems that must be addressed: how to interface with the simulation code, how to run simulations (locally, or on a cluster or cloud), which model type to approximate the data with and how to set the model complexity (e.g., topology of a neural network), how to estimate the model quality and ensure the domain expert trusts the model, how to decide which simulations to run (data collection), etc. The data collection aspect is worth emphasizing. Since data is computationally expensive to obtain and the optimal data distribution is not known up front, data points should be selected iteratively, there where the information gain will be the greatest. A sampling function is needed that minimizes the number of sample points selected in each iteration, yet maximizes the information gain of each iteration step. This process is called adaptive sampling but is also known as active learning, or sequential design. There is a complex dependency web between these different options and dealing with these dependencies is non-trivial, particularly for a domain expert for whom the surrogate model is just an intermediate step towards solving a larger, more important problem. Few domain experts will be experts in the intricacies of efficient sampling and modeling strategies. Their primary concern is obtaining an accurate replacement metamodel for their problem as fast as possible and with minimal overhead (Gorissen et al., 2009d). As a result these choices are often made in a pragmatic, sometimes even ad-hoc, manner. This paper discusses an advanced, and integrated software framework that provides a flexible and rigorous means to tackle such problems. This work lies at the intersection of Machine Learning/AI, Modeling and Simulation, and Distributed Computing. The methods developed are applicable to any domain where a cheap, accurate, approximation is needed to replace some expensive reference model. Our experience has been that the availability of such a framework can facilitate the transfer of knowledge from surrogate modeling researchers and lower the barrier of entry for domain experts. 2. SUMO Toolbox The platform in question is the Matlab SUrrogate MOdeling (SUMO) Toolbox, illustrated in Figure 1. Given a simulation engine (Fluent, Cadence, Abaqus, HFSS, etc.) or other data source (data 2052 A S URROGATE M ODELING AND A DAPTIVE S AMPLING TOOLBOX FOR C OMPUTER BASED D ESIGN Figure 1: The SUMO Toolbox is a flexible framework for accurate global surrogate modeling and adaptive sampling (active learning). It features a rich set of plugins, is applicable to a wide range of domains, and can be applied in an autonomous, black-box fashion, or under full manual control. Written in Matlab and Java it is fully cross platform and comes with a large (60+) number of example problems. set, Matlab script, Java class, etc.), the toolbox drives the data source to produce a surrogate model within the time and accuracy constraints set by the user. The SUMO Toolbox adopts a microkernel design philosophy with many different plugins available for each of the different sub-problems:1 model types (rational functions, Kriging, splines, SVM, ANN, etc.), hyperparameter optimization algorithms (Particle Swarm Optimization, Efficient Global Optimization, simulated annealing, Genetic Algorithm, etc.), model selection algorithms (cross validation, AIC, Leave-out set, etc.), sample selection (random, error based, density based, hybrid, etc.), Design of Experiments (Latin hypercube, Box-Bhenken, etc.), and sample evaluation methods (local, on a cluster or grid). The behavior of each software component is configurable through a central XML file and components can easily be added, removed or replaced by custom implementations. In addition the toolbox provides ‘meta’ plugins. For example to automatically select the best model type for a given problem (Gorissen et al., 2009d) or to use multiple model selection or sample selection criteria in concert (Gorissen et al., 2010). Furthermore, there is built-in support for high performance computing. On the modeling side, the model generation process can take full advantage of multi-core CPUs and even of a complete cluster or grid. This can result in significant speedups for model types where the fitting process can be expensive (e.g., neural networks). Likewise, sample evaluation (simulation) can occur locally (with the option to take advantage of multi-core architectures) or on a separate compute cluster or grid (possibly accessed through a remote head-node). All interfacing with the grid middleware 1. The full list of plugins and features can be found at http://www.sumowiki.intec.ugent.be. 2053 G ORISSEN , C OUCKUYT, D EMEESTER , D HAENE AND C ROMBECQ (submission, job monitoring, rescheduling of failed/lost simulation points, etc.) is handled transparently and automatically (see Gorissen et al., 2009c for more details). Also, the sample evaluation component runs in parallel with the other components (non-blocking) and not sequentially. This allows for an optimal use of computational resources. In addition the SUMO Toolbox contains extensive logging and profiling capabilities so that the modeling process can easily be tracked and the modeling decisions understood. Once a final model has been generated, a GUI tool is available to visually explore the model (including derivatives and prediction uncertainty), assess its quality, and export it for use in other software tools. 3. Applications The SUMO Toolbox has already been applied successfully to a very wide range of applications, including RF circuit block modeling (Gorissen et al., 2009b), hydrological modeling (Couckuyt et al., 2009), Electronic Packaging (Zhu and Franzon, 2009), aerodynamic modeling (Gorissen et al., 2009a), process engineering (Stephens et al., 2009), and automotive data modeling (Gorissen et al., 2010). Besides global modeling capabilities, the SUMO Toolbox also includes a powerful optimization framework based on the Efficient Global Optimization framework developed by Jones et al. (1998). As of version 6.1, the toolbox also contains an example of how the framework can also be applied to solve classification problems. In sum, the goal of the toolbox is to fill the void in machine learning software when it comes to the challenging, costly, real-valued, problems faced in computational engineering. The toolbox is in use successfully at various institutions and we are continuously refining and extending the set of available plugins as the number of applications increase. Usage instructions, design documentation, and stable releases for all major platforms can be found at http://www.sumo.intec.ugent.be. References I. Couckuyt, D. Gorissen, H. Rouhani, E. Laermans, and T. Dhaene. Evolutionary regression modeling with active learning: An application to rainfall runoff modeling. In International Conference on Adaptive and Natural Computing Algorithms, volume LNCS 5495, pages 548–558, Sep. 2009. A. Forrester, A. Sobester, and A. Keane. Engineering Design Via Surrogate Modelling: A Practical Guide. Wiley, 2008. D. Gorissen, K. Crombecq, I. Couckuyt, and T. Dhaene. Foundations of Computational Intelligence, Volume 1: Learning and Approximation: Theoretical Foundations and Applications, volume 201, chapter Automatic approximation of expensive functions with active learning, pages 35–62. Springer Verlag, Series Studies in Computational Intelligence, 2009a. D. Gorissen, L. De Tommasi, K. Crombecq, and T. Dhaene. Sequential modeling of a low noise amplifier with neural networks and active learning. Neural Computing and Applications, 18(5): 485–494, Jun. 2009b. D. Gorissen, T. Dhaene, P. Demeester, and J. Broeckhove. Handbook of Research on Grid Technologies and Utility Computing: Concepts for Managing Large-Scale Applications, chapter Grid enabled surrogate modeling, pages 249–258. IGI Global, May 2009c. 2054 A S URROGATE M ODELING AND A DAPTIVE S AMPLING TOOLBOX FOR C OMPUTER BASED D ESIGN D. Gorissen, T. Dhaene, and F. DeTurck. Evolutionary model type selection for global surrogate modeling. Journal of Machine Learning Research, 10:2039–2078, 2009d. D. Gorissen, I. Couckuyt, E. Laermans, and T. Dhaene. Multiobjective global surrogate modeling,dealing with the 5-percent problem. Engineering with Computers, 26(1):81–89, Jan. 2010. D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, Nov. 1998. ISSN 0925-5001. T. W. Simpson, V. Toropov, V. Balabanov, and F. A. C. Viana. Design and analysis of computer experiments in multidisciplinary design optimization: a review of how far we have come or not. In Proceedings of the 12th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, 2008 MAO, Victoria, Canada, 2008. D.W. Stephens, D. Gorissen, and T. Dhaene. Surrogate based sensitivity analysis of process equipment. In Proc. of 7th International Conference on CFD in the Minerals and Process Industries, CSIRO, Melbourne, Australia, Dec. 2009. T. Zhu and P. D. Franzon. Application of surrogate modeling to generate compact and PVT-sensitive IBIS models. In Proceedings of the 18th Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS), Oct. 2009. 2055
6 0.094406083 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
7 0.082578428 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
8 0.082358353 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
9 0.081037208 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages
10 0.080425121 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
11 0.078491449 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
12 0.063703269 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
13 0.062959626 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
14 0.060765531 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
15 0.059531048 63 jmlr-2010-Learning Instance-Specific Predictive Models
16 0.058142129 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
17 0.056595474 102 jmlr-2010-Semi-Supervised Novelty Detection
18 0.055635653 94 jmlr-2010-Quadratic Programming Feature Selection
19 0.054481506 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
20 0.053519085 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models
topicId topicWeight
[(4, 0.014), (8, 0.025), (15, 0.011), (16, 0.497), (21, 0.019), (32, 0.033), (36, 0.02), (37, 0.062), (75, 0.084), (81, 0.015), (85, 0.032), (96, 0.016), (97, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.62756449 100 jmlr-2010-SFO: A Toolbox for Submodular Function Optimization
Author: Andreas Krause
Abstract: In recent years, a fundamental problem structure has emerged as very useful in a variety of machine learning applications: Submodularity is an intuitive diminishing returns property, stating that adding an element to a smaller set helps more than adding it to a larger set. Similarly to convexity, submodularity allows one to efficiently find provably (near-) optimal solutions for large problems. We present SFO, a toolbox for use in MATLAB or Octave that implements algorithms for minimization and maximization of submodular functions. A tutorial script illustrates the application of submodularity to machine learning and AI problems such as feature selection, clustering, inference and optimized information gathering.
2 0.22477615 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
Author: Gideon S. Mann, Andrew McCallum
Abstract: In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE fits model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random fields. Experimental results demonstrate accuracy improvements over supervised training and a number of other stateof-the-art semi-supervised learning methods for these models. Keywords: generalized expectation criteria, semi-supervised learning, logistic regression, conditional random fields
3 0.22274897 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
Author: Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin
Abstract: Maximum entropy (Maxent) is useful in natural language processing and many other areas. Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent. With many variants of IS methods, it is difficult to understand them and see the differences. In this paper, we create a general and unified framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods. We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods. Keywords: maximum entropy, iterative scaling, coordinate descent, natural language processing, optimization
4 0.22248004 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
5 0.22134237 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression
Author: Miguel Lázaro-Gredilla, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, Aníbal R. Figueiras-Vidal
Abstract: We present a new sparse Gaussian Process (GP) model for regression. The key novel idea is to sparsify the spectral representation of the GP. This leads to a simple, practical algorithm for regression tasks. We compare the achievable trade-offs between predictive accuracy and computational requirements, and show that these are typically superior to existing state-of-the-art sparse approximations. We discuss both the weight space and function space representations, and note that the new construction implies priors over functions which are always stationary, and can approximate any covariance function in this class. Keywords: Gaussian process, probabilistic regression, sparse approximation, power spectrum, computational efficiency
6 0.21957988 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
7 0.21641144 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
8 0.21581787 109 jmlr-2010-Stochastic Composite Likelihood
9 0.21393326 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
10 0.21388318 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
11 0.21371718 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
12 0.21356547 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
13 0.21286985 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
14 0.21190202 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
15 0.21152198 63 jmlr-2010-Learning Instance-Specific Predictive Models
16 0.21123096 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
17 0.21020345 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
19 0.20944275 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
20 0.20927091 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs