jmlr jmlr2008 jmlr2008-18 knowledge-graph by maker-knowledge-mining

18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model


Source: pdf

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. [sent-10, score-0.26]

2 We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. [sent-11, score-0.423]

3 Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing 1. [sent-16, score-0.487]

4 We also address the problem of sparse linear coding of natural images, optimising the codebook by empirical Bayesian marginal likelihood maximisation. [sent-78, score-0.336]

5 Optimal design is discussed in Section 4, and an approximation to the marginal likelihood is given in Section 5. [sent-92, score-0.274]

6 We also introduce the applications of interest in our work here: identification of gene networks, and sparse coding of natural images, and we give remarks about applications to compressive sensing, which are subject to work in progress (Seeger and Nickisch, 2008). [sent-101, score-0.377]

7 A sparsity prior (or regulariser) on the coefficients, for example in the sparse linear model (1) with Laplace prior (2), leads to just that. [sent-111, score-0.294]

8 The posterior for the “very sparse” prior shows shrinkage towards the axes even more strongly, and in terms of enforcing sparsity, this prior is preferable to the Laplacian. [sent-134, score-0.268]

9 , 1999; Peeters and Westra, 2004), the mode a of the posterior P(a|X , u) is found through ˆ convex optimisation (recall that the log posterior is concave), and a is treated as posterior estimate ˆ of a. [sent-164, score-0.597]

10 In contrast, in the Bayesian case, the posterior mass of all exactly sparse a (at least one component exactly zero) is zero, because the posterior has a density w. [sent-167, score-0.397]

11 The Bayesian approach via the sparse linear model (1) has been suggested by Lewicki and Olshausen (1999), where the average coding cost of images under the model is put forward as criterion for ranking different code matrices X . [sent-224, score-0.261]

12 In a Bayesian nomenclature, the averageRcoding cost is the negative log marginal likelihood − log P(D), where P(D) = ∏ j P(u j ), P(u j ) = P(u j |a j )P(a j ) da j , and differences of these for different X are log Bayes factors. [sent-226, score-0.345]

13 An approximate Bayesian variant of compressive sensing has been proposed by Ji and Carin (2007), using sparse Bayesian learning (Tipping, 2001) to approximate the inference. [sent-261, score-0.406]

14 Furthermore, the sparsity of a is encoded via a Laplace prior, motivating the sparse linear model for compressive sensing. [sent-269, score-0.319]

15 Another point in which our approach differs from much of the existing work on compressive sensing, has to do with the sparsity prior we employ. [sent-278, score-0.281]

16 Our sparsity prior concentrates on the bi-separation characteristic, without enforcing exact sparseness, thus may be better suited to many compressive sensing applications than the requirement of exact sparsity. [sent-283, score-0.381]

17 Since the likelihood is a Gaussian function of a, the true log posterior log P(a|D) is concave as well, thus has a single mode only. [sent-295, score-0.365]

18 If P(0) (a) := N(u|X a, σ2 I) is the Gaussian likelihood (1), the true posterior is P(a|D) ∝ P(0) (a) ∏ ti (ai ), i ˜ τ τ ti (ai ) = e−˜ |ai | . [sent-296, score-0.417]

19 In order to motivate EP, note that an optimal Gaussian posterior approximation Q(a) (at least in our context here) would be obtained by setting its mean and covariance to the true posterior 768 BAYESIAN I NFERENCE FOR S PARSE L INEAR M ODEL statistics. [sent-299, score-0.335]

20 However, we are able to compute one-dimensional integrals involving a single non-Gaussian site ti (ai ). [sent-301, score-0.285]

21 The EP posterior approximation has the form ˜ Q(a) ∝ P(0) (a) ∏ ti (ai ), i ˜ where ti (ai |bi , πi ) are Gaussian factors. [sent-303, score-0.401]

22 We will see that a good representation has to allow for the rapid “random-access” extraction of marginals Q(ai ), and we have to be able to efficiently and robustly update it after a ˆ change of bi , πi . [sent-313, score-0.262]

23 The site approximations are ti (ai ) = NU (ai |σ−2 bi , σ−2 πi ), so that Q is a Gaussian. [sent-318, score-0.355]

24 Also, a careful reader may have noted that we are only concerned general site approximations t ˆ about marginal distributions Q(ai ) and Pi (ai ) during an EP update at ti . [sent-323, score-0.469]

25 ˜ It is important to note that EP is not merely a local approximation, in that ti is somehow fitted to 11 because posterior mean and covariance are shaped jointly ti locally. [sent-339, score-0.366]

26 Loosely speaking, the likelihood couples coefficients ai , so that the intentions of the prior factors ti (ai ), namely to force their respective arguments towards zero, have to be weighted against each other in a very non-local procedure. [sent-341, score-0.358]

27 In fact, non-locality is a central aspect of Bayesian inference which makes it so hard to compute, and inference is particularly hard to do in models where strong long-range posterior dependencies are present which cannot easily be predicted from local interactions only. [sent-343, score-0.284]

28 Finally, would it not be much simpler and more efficient to locate the true posterior mode through convex optimisation (recall that the posterior is log-concave), then do a Laplace approximation there, which amounts to expanding the log posterior density to second order around the mode? [sent-344, score-0.632]

29 Therefore, an EP only, P ˜ update automatically results in the site approximation ti being a (Gaussian) function of ai only. [sent-359, score-0.539]

30 An EP update is local, in that its input is a marginal Q(ai ) and it affects single site parameters bi , πi only. [sent-365, score-0.431]

31 However, we are really interested in the marginal means and variances, which in some cases are only weakly dependent on certain site parameters. [sent-395, score-0.282]

32 Let d(a, b) := |a − b|/ max{|a|, |b|, 10 −3 } and ∆i = max{d(hi , hi ), √ σd( ρi , ρi )}, where Q(ui ) = N(hi , σ2 ρi ) and Q (ui ) = N(hi , σ2 ρi ) are the posterior marginals before and after an update at site i. [sent-397, score-0.483]

33 2 Posterior Representation In this section, we develop a representation of the posterior approximation Q(a) = N(h, σ 2 Σ) which allows efficient access to entries of h, diag Σ (marginal moments), and which can be updated robustly and efficiently for single site parameter changes (after EP updates). [sent-400, score-0.46]

34 After an EP update bi → bi , πi → πi , we have that L (L )T = LLT + (πi − πi )δi δT , i L γ = Lγ + (bi − bi )δi . [sent-411, score-0.289]

35 After an EP update bi → bi , πi → πi , the representation is modified as (0) (0) follows. [sent-427, score-0.264]

36 3 The EP Update An EP update works by matching moments between a tilted and the new posterior distribution. [sent-441, score-0.256]

37 For an update at site i, we require the marginal Q(ai ) = N(hi , σ2 ρi ) only, which is obtained from the Q representation. [sent-442, score-0.361]

38 While we cannot offer a firm explanation for this yet, our intuition is that the effect of Laplace prior sites on the posterior is much stronger, trying to emulate the essentially discrete feature selection process in a “unimodal” manner. [sent-459, score-0.319]

39 In the gene network identification application, we ran into problems of numerical instability coming from the combination of Laplace sites with very underdetermined coupling factors P(0) . [sent-461, score-0.255]

40 This is ˜ because we divide through the site approximation ti (ai ), whose πi ≥ κ keeps the variance small. [sent-483, score-0.32]

41 We then try to do an EP update based on a very wide cavity distribution Q\i (ai ) and a quite narrow site ti (ai ) (enforcing a strong sparsity constraint requires a rather large τ). [sent-485, score-0.443]

42 The only difference to standard EP is that we tie the parameters of the corresponding fractional site approximation replicas. [sent-490, score-0.287]

43 We choose the new site ˆ parameters bi , πi such that the moments of Pi and Q match. [sent-495, score-0.274]

44 Fractional updates are easily implemented for sites t (a |τ) ˜ ˜ ∝ Q ˜i i i i i with some hyperparameter τ, such that ti (ai |τ)η = ti (ai |ητ). [sent-500, score-0.391]

45 Finally, the site parameters are updated as ˜ bi = (1 − η)bi + bi , ˜ πi = (1 − η)πi + πi , ˆ upon which Pi and the new Q have the same moments. [sent-508, score-0.343]

46 There, the full standard EP update is computed, but the site parameters are updated to a convex combination of old and proposed new values. [sent-510, score-0.282]

47 Theorem 1 Let EP be applied to a model with true posterior of the form P(a|D) ∝ P(0) (a) ∏ ti (ai ), i where P(0) (a) is a joint unnormalised Gaussian factor, and the sites ti (ai ) are log-concave. [sent-554, score-0.476]

48 The theorem holds just as well for general sites t i (a) ˜ with corresponding site approximations ti (a) = NU (σ−2 bi , σ−2 Πi ), if “πi ≥ 0” is replaced by “Πi positive semidefinite”. [sent-559, score-0.465]

49 Here, the posterior is a product of independent factors for the rows of A, so that for given (x ∗ , u∗ ), the information gain is the sum of D[Q Q] over the posterior factors, where (x∗ , u∗, j ) is appended to D for the j-th factor. [sent-635, score-0.329]

50 However, inference schemes such as expectation propagation (and other variational ones) are designed to approximate the posterior marginals well. [sent-673, score-0.404]

51 (5) i=1 ˜ Recall that in EP, the sites ti are replaced by approximations ti of Gaussian form. [sent-703, score-0.326]

52 Earlier on, we did not bother with the normalisation constants of these approximations, but now we have to make them ˜ ˜ explicit: ti (ai ) → Citi (ai ), ti (ai ) = NU (ai |σ−2 bi , σ−2 πi ). [sent-704, score-0.286]

53 Furthermore, if α is a parameter of the site ti independent of P(0) , then ∂L ∂ log Zi ∂ = = EPi logti (ai ) . [sent-715, score-0.348]

54 It is well known that ∇θ(0) log P(u) = EP(a|D) [∇θ(0) log P(0) (a)], from which (7) is obtained by replacing the true posterior by the EP approximation Q(a): the true gradient of the approximate criterion is the approximate gradient of the true criterion. [sent-719, score-0.377]

55 The dominant computation within our framework, both for experimental design and marginal likelihood maximisation, is spent performing a sweep of EP updates. [sent-739, score-0.272]

56 For each site i, the marginal Q(a i ) has to be determined, and the representation for doing so has to be updated afterwards. [sent-741, score-0.353]

57 A key problem is how to efficiently detect the sites whose update would change the current posterior the most. [sent-744, score-0.339]

58 For an update at site i, we require Q(ai ) = N(hi , σ2 ρi ). [sent-773, score-0.256]

59 For example, in the context of experimental design, it is not affordable to update each site after each new data point inclusion. [sent-787, score-0.256]

60 It is reasonable to assume that a small impact of an EP update on the marginal Q(ai ) implies that the whole posterior Q changes little, so site i need not be updated at the moment. [sent-793, score-0.537]

61 In order to direct EP updates towards sites with maximum marginal impact, it is necessary to keep all marginals Q(ai ) up-to-date at all times. [sent-794, score-0.321]

62 Experiments In this section, we present experiments for gene regulatory network identification, for sparse coding of natural images, and for compressive sensing. [sent-843, score-0.441]

63 1 Regulatory Network Identification Our application of experimental design to gene network identification, using the sparse linear model, has been described in Section 2. [sent-845, score-0.26]

64 In contrast to this, we follow the hypothesis of Lewicki and Olshausen (1999) and learn X by maximising the EP approximation to the log marginal likelihood log P(D|X ) (see Section 5). [sent-968, score-0.317]

65 Let φ := − ∑ j∈B L j be the EP approximation to this criterion, evaluated over a batch of size |B| (here, L j is the EP log marginal likelihood approximation on image j). [sent-978, score-0.323]

66 It is easy to see that the gradient of the exact log marginal likelihood is ∇X log P(u j |X ) = σ−2 EP(a j |u j ,X ) [e j aT ] = σ−2 (u − X E[a j ])E[a j ]T − X Cov[a j ] , j where e j := u j − X a j . [sent-989, score-0.282]

67 In other words, the posterior mean E[a j ] is replaced with the mode, and the second term depending on the posterior covariance Cov[a j ] is neglected altogether. [sent-991, score-0.3]

68 Since the Laplace sparsity prior leads to a posterior which is significantly skewed (towards coordinate axes, see Figure 1), mean and mode tend to be quite different. [sent-992, score-0.326]

69 28 In EP, the posterior expectations are replaced by EQ [·], where Q = N(h, σ2 Σ) is the EP posterior approximation (see Appendix C). [sent-993, score-0.335]

70 80 Table 1: EP negative log marginal likelihood (EP average coding cost) per image, evaluated on the full test set (50000 cases), for different methods after 10000 batch updates of learning X . [sent-1057, score-0.367]

71 We compare methods in general by evaluating the EP negative log marginal likelihood approximation on the test set, normalised by the number of images. [sent-1066, score-0.254]

72 learning curves along 10000 batch updates are shown in Figure 3 (using the test subset), and final EP negative log marginal likelihoods per image on the full test are given in Table 1. [sent-1077, score-0.267]

73 While X could still be learned with τ = 1, the test set log marginal likelihood evaluations for these codes could not be computed for many patches (using EP with η = 1). [sent-1106, score-0.27]

74 80 Table 2: EP negative log marginal likelihood (EP average coding cost) per image, evaluated on the full test set (50000 cases), where X has been learned by EP. [sent-1189, score-0.302]

75 Apart from learning codes with EP, we can also use the log marginal likelihood approximation of EP in order to compare codes obtained by other methods. [sent-1221, score-0.356]

76 The latter is certainly significantly faster than any of the other methods here, but its suboptimal performance on the same fixed data, and more importantly the lack of an experimental design framework, clearly motivates considering approximate Bayesian inference for compressive sensing as well. [sent-1286, score-0.426]

77 Before we start, we remind the reader that by “approximate inference” method, we mean a technique which delivers a useful approximation of marginal (or even joint) posteriors, and ideally can also be used in order to approximate the marginal likelihood. [sent-1298, score-0.278]

78 ARD works by placing a prior N(a i |0, σ2 π−1 ) on ai , where πi is a i scale parameter, then maximising the marginal likelihood P(u, π) w. [sent-1306, score-0.355]

79 Variational Mean Field Bayes A direct approach for obtaining an easily computable lower bound on the log marginal likelihood log P(u) works by lower-bounding the sites ti (ai ) by terms of Gaussian form. [sent-1377, score-0.5]

80 (2006) give the precise relationship between SBL and direct site bounding (called “integral case” and “convex case” there), showing that if ti admits a scale mixture decomposition, it can also be bounded via Legendre duality. [sent-1399, score-0.285]

81 797 S EEGER Note that for the same (β, τ), πi is smaller for the SBL update than for the direct site bounding one. [sent-1400, score-0.256]

82 A comparison between approximate inference techniques would be incomplete without including variational mean field Bayes (VMFB) (Attias, 2000; Ghahramani and Beal, 2001), maybe the most well known variational technique in the moment. [sent-1407, score-0.266]

83 org), although we understand this term as encompassing other variational methods for Bayesian inference as well, such as EP, SBL, direct site bounding, and others more. [sent-1410, score-0.327]

84 VMFB for the sparse linear model is equivalent to direct site bounding, as has been shown in Palmer et al. [sent-1413, score-0.274]

85 Another drawback of MCMC is that while samples of the posterior are obtained, these cannot be used in a simple way in order to obtain a good estimate of the log marginal likelihood log P(u) (see Section 5). [sent-1452, score-0.432]

86 The optimal design capability has been demonstrated for the application of gene regulatory network identification, where the sparsity prior was found to be essential in order to realise very significant gains. [sent-1459, score-0.339]

87 They address the sparse image coding problem with the sparse linear model, but do not consider optimal design applications. [sent-1496, score-0.394]

88 On the other hand, they do not employ the natural EP marginal likelihood approximation we use here, but rather a variational bound. [sent-1501, score-0.274]

89 In this context, the sparse linear model has been proposed as a useful setup, in which codes can be learned by maximising the marginal likelihood. [sent-1508, score-0.253]

90 3 indicate that approximate Bayesian inference and experimental design hold significant promises for compressive sensing, where so far approaches based on L 1 penalised estimation and random designs seem to predominate. [sent-1517, score-0.326]

91 Our experiences with the sparse linear model on the image coding problem (or with very underdetermined gene network identification settings) suggest that in some relevant cases, numerical stability issues seem to be inherently present in EP (i. [sent-1521, score-0.359]

92 The preliminary experiments with compressive sensing have been done in joint work with Hannes Nickisch. [sent-1561, score-0.275]

93 802 BAYESIAN I NFERENCE FOR S PARSE L INEAR M ODEL ˜ ˜ The Laplace site is ti (a) = exp(−˜ |a|), τ = τ/σ > 0. [sent-1568, score-0.285]

94 40 We need to compute moments Ik = EN(h,ρ) [ak ti (a)], k = 0, 1, 2, where we write a = ai , h = h\i , ρ = ρ\i ˜ ˜ ˜ for simplicity. [sent-1570, score-0.275]

95 Note that Zi = I0 = EQ\i [ti (ai )] is not required for the EP update itself, but has to be evaluated if an approximation to the marginal likelihood P(D) is ˜ sought (see Section 5; recall that Zi as computed here has to be multiplied with the prefactor τ/2 of ti which we omitted). [sent-1606, score-0.378]

96 For an update at site i, we can assume e \i (a ) is a proper Gaussian. [sent-1611, score-0.256]

97 Namely, log Q\i is jointly concave in (ai , h\i ) (being a negative quadratic in ai − h\i ), so that ti (ai )Q\i (ai |h\i ) is log-concave in (ai , h\i ). [sent-1613, score-0.311]

98 Namely, what is referred to as site parameters in Seeger (2005), are in fact the σ−2 bi , σ−2 πi here, not bi , πi . [sent-1683, score-0.317]

99 One could possibly choose another representation of the Laplace sites, whence the equivalence of VMFB and direct site bounding would not hold, but this is not done here. [sent-1707, score-0.254]

100 Experimental design for efficient identification of gene regulatory networks using sparse Bayesian models. [sent-2199, score-0.272]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ep', 0.651), ('site', 0.177), ('olshausen', 0.176), ('laplace', 0.172), ('sbl', 0.154), ('posterior', 0.15), ('compressive', 0.143), ('ai', 0.14), ('bayesian', 0.127), ('eeger', 0.126), ('nference', 0.126), ('sites', 0.11), ('lewicki', 0.109), ('ti', 0.108), ('marginal', 0.105), ('sensing', 0.1), ('sparse', 0.097), ('inear', 0.085), ('parse', 0.085), ('design', 0.083), ('coding', 0.083), ('variational', 0.083), ('sparsity', 0.079), ('update', 0.079), ('seeger', 0.076), ('fractional', 0.075), ('odel', 0.074), ('eq', 0.071), ('bi', 0.07), ('inference', 0.067), ('underdetermined', 0.065), ('updates', 0.065), ('log', 0.063), ('prior', 0.059), ('gaussian', 0.058), ('field', 0.057), ('code', 0.055), ('gene', 0.054), ('tipping', 0.051), ('likelihood', 0.051), ('codes', 0.051), ('mcmc', 0.05), ('rvm', 0.05), ('priors', 0.049), ('degenerate', 0.048), ('palmer', 0.047), ('opper', 0.046), ('optimisation', 0.046), ('representation', 0.045), ('casella', 0.044), ('steinke', 0.044), ('vmfb', 0.044), ('quadrature', 0.042), ('mvms', 0.042), ('winther', 0.042), ('marginals', 0.041), ('maximisation', 0.039), ('wipf', 0.039), ('woodbury', 0.039), ('regulatory', 0.038), ('mode', 0.038), ('hi', 0.036), ('hyperparameters', 0.036), ('approximation', 0.035), ('lters', 0.034), ('minka', 0.034), ('image', 0.034), ('pi', 0.033), ('sweep', 0.033), ('approximate', 0.033), ('llt', 0.033), ('optimised', 0.033), ('cholesky', 0.033), ('mvm', 0.033), ('zi', 0.032), ('done', 0.032), ('propagation', 0.03), ('student', 0.03), ('sequential', 0.03), ('gain', 0.029), ('nu', 0.029), ('park', 0.028), ('carin', 0.028), ('girolami', 0.028), ('nickisch', 0.028), ('candidates', 0.028), ('marginalisation', 0.028), ('renormalisation', 0.028), ('inclusion', 0.027), ('ndings', 0.027), ('robustly', 0.027), ('moments', 0.027), ('updated', 0.026), ('images', 0.026), ('network', 0.026), ('vt', 0.026), ('xt', 0.025), ('numerically', 0.025), ('regularisation', 0.025), ('stable', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

2 0.20990983 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

Author: Hannes Nickisch, Carl Edward Rasmussen

Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classification, Laplaces’s approximation, expectation propagation, variational bounding, mean field methods, marginal likelihood evidence, MCMC

3 0.13720095 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization

4 0.12529896 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression

Author: Andreas Christmann, Arnout Van Messem

Abstract: We investigate robustness properties for a broad class of support vector machines with non-smooth loss functions. These kernel methods are inspired by convex risk minimization in infinite dimensional Hilbert spaces. Leading examples are the support vector machine based on the ε-insensitive loss function, and kernel based quantile regression based on the pinball loss function. Firstly, we propose with the Bouligand influence function (BIF) a modification of F.R. Hampel’s influence function. The BIF has the advantage of being positive homogeneous which is in general not true for Hampel’s influence function. Secondly, we show that many support vector machines based on a Lipschitz continuous loss function and a bounded kernel have a bounded BIF and are thus robust in the sense of robust statistics based on influence functions. Keywords: Bouligand derivatives, empirical risk minimization, influence function, robustness, support vector machines

5 0.112396 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function    (Special Topic on Model Selection)

Author: Michiel Debruyne, Mia Hubert, Johan A.K. Suykens

Abstract: Recent results about the robustness of kernel methods involve the analysis of influence functions. By definition the influence function is closely related to leave-one-out criteria. In statistical learning, the latter is often used to assess the generalization of a method. In statistics, the influence function is used in a similar way to analyze the statistical efficiency of a method. Links between both worlds are explored. The influence function is related to the first term of a Taylor expansion. Higher order influence functions are calculated. A recursive relation between these terms is found characterizing the full Taylor expansion. It is shown how to evaluate influence functions at a specific sample distribution to obtain an approximation of the leave-one-out error. A specific implementation is proposed using a L1 loss in the selection of the hyperparameters and a Huber loss in the estimation procedure. The parameter in the Huber loss controlling the degree of robustness is optimized as well. The resulting procedure gives good results, even when outliers are present in the data. Keywords: kernel based regression, robustness, stability, influence function, model selection

6 0.10751488 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

7 0.10106616 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

8 0.092963107 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

9 0.091872379 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

10 0.081841528 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

11 0.053887513 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

12 0.052876301 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies

13 0.051188257 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure

14 0.048866708 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction

15 0.048827346 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding

16 0.047912236 80 jmlr-2008-Ranking Individuals by Group Comparisons

17 0.04581615 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

18 0.045314323 23 jmlr-2008-Comments on the Complete Characterization of a Family of Solutions to a GeneralizedFisherCriterion

19 0.045102231 56 jmlr-2008-Magic Moments for Structured Output Prediction

20 0.042916734 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.272), (1, -0.081), (2, -0.221), (3, 0.031), (4, 0.021), (5, 0.202), (6, -0.14), (7, -0.095), (8, -0.085), (9, 0.294), (10, -0.206), (11, -0.031), (12, -0.088), (13, -0.113), (14, -0.047), (15, -0.119), (16, 0.077), (17, -0.008), (18, 0.084), (19, 0.051), (20, -0.264), (21, -0.04), (22, 0.081), (23, -0.075), (24, -0.053), (25, -0.125), (26, -0.068), (27, -0.082), (28, -0.062), (29, 0.068), (30, 0.02), (31, -0.006), (32, -0.101), (33, 0.124), (34, -0.039), (35, -0.084), (36, -0.075), (37, 0.059), (38, -0.102), (39, 0.002), (40, -0.031), (41, 0.054), (42, -0.092), (43, 0.022), (44, 0.034), (45, 0.048), (46, 0.038), (47, 0.002), (48, -0.008), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96906817 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

2 0.60262865 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

Author: Hannes Nickisch, Carl Edward Rasmussen

Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classification, Laplaces’s approximation, expectation propagation, variational bounding, mean field methods, marginal likelihood evidence, MCMC

3 0.47566625 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function    (Special Topic on Model Selection)

Author: Michiel Debruyne, Mia Hubert, Johan A.K. Suykens

Abstract: Recent results about the robustness of kernel methods involve the analysis of influence functions. By definition the influence function is closely related to leave-one-out criteria. In statistical learning, the latter is often used to assess the generalization of a method. In statistics, the influence function is used in a similar way to analyze the statistical efficiency of a method. Links between both worlds are explored. The influence function is related to the first term of a Taylor expansion. Higher order influence functions are calculated. A recursive relation between these terms is found characterizing the full Taylor expansion. It is shown how to evaluate influence functions at a specific sample distribution to obtain an approximation of the leave-one-out error. A specific implementation is proposed using a L1 loss in the selection of the hyperparameters and a Huber loss in the estimation procedure. The parameter in the Huber loss controlling the degree of robustness is optimized as well. The resulting procedure gives good results, even when outliers are present in the data. Keywords: kernel based regression, robustness, stability, influence function, model selection

4 0.43306825 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization

5 0.42497891 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression

Author: Andreas Christmann, Arnout Van Messem

Abstract: We investigate robustness properties for a broad class of support vector machines with non-smooth loss functions. These kernel methods are inspired by convex risk minimization in infinite dimensional Hilbert spaces. Leading examples are the support vector machine based on the ε-insensitive loss function, and kernel based quantile regression based on the pinball loss function. Firstly, we propose with the Bouligand influence function (BIF) a modification of F.R. Hampel’s influence function. The BIF has the advantage of being positive homogeneous which is in general not true for Hampel’s influence function. Secondly, we show that many support vector machines based on a Lipschitz continuous loss function and a bounded kernel have a bounded BIF and are thus robust in the sense of robust statistics based on influence functions. Keywords: Bouligand derivatives, empirical risk minimization, influence function, robustness, support vector machines

6 0.4127315 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

7 0.40969786 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

8 0.40259463 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

9 0.32313663 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

10 0.30833852 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

11 0.23420496 80 jmlr-2008-Ranking Individuals by Group Comparisons

12 0.22004759 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding

13 0.21976161 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure

14 0.20016129 23 jmlr-2008-Comments on the Complete Characterization of a Family of Solutions to a GeneralizedFisherCriterion

15 0.19707139 83 jmlr-2008-Robust Submodular Observation Selection

16 0.19343005 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies

17 0.19108741 96 jmlr-2008-Visualizing Data using t-SNE

18 0.18690723 56 jmlr-2008-Magic Moments for Structured Output Prediction

19 0.18217796 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses

20 0.17436416 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (1, 0.013), (5, 0.035), (9, 0.016), (31, 0.015), (40, 0.037), (53, 0.273), (54, 0.039), (58, 0.058), (66, 0.077), (72, 0.015), (76, 0.037), (88, 0.072), (92, 0.044), (94, 0.081), (99, 0.078)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79592413 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

2 0.53968155 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

Author: Hannes Nickisch, Carl Edward Rasmussen

Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classification, Laplaces’s approximation, expectation propagation, variational bounding, mean field methods, marginal likelihood evidence, MCMC

3 0.50528085 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization

4 0.49206495 83 jmlr-2008-Robust Submodular Observation Selection

Author: Andreas Krause, H. Brendan McMahan, Carlos Guestrin, Anupam Gupta

Abstract: In many applications, one has to actively select among a set of expensive observations before making an informed decision. For example, in environmental monitoring, we want to select locations to measure in order to most effectively predict spatial phenomena. Often, we want to select observations which are robust against a number of possible objective functions. Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. In this paper, we present the Submodular Saturation algorithm, a simple and efficient algorithm with strong theoretical approximation guarantees for cases where the possible objective functions exhibit submodularity, an intuitive diminishing returns property. Moreover, we prove that better approximation algorithms do not exist unless NP-complete problems admit efficient algorithms. We show how our algorithm can be extended to handle complex cost functions (incorporating non-unit observation cost or communication and path costs). We also show how the algorithm can be used to near-optimally trade off expected-case (e.g., the Mean Square Prediction Error in Gaussian Process regression) and worst-case (e.g., maximum predictive variance) performance. We show that many important machine learning problems fit our robust submodular observation selection formalism, and provide extensive empirical evaluation on several real-world problems. For Gaussian Process regression, our algorithm compares favorably with state-of-the-art heuristics described in the geostatistics literature, while being simpler, faster and providing theoretical guarantees. For robust experimental design, our algorithm performs favorably compared to SDP-based algorithms. c 2008 Andreas Krause, H. Brendan McMahan, Carlos Guestrin and Anupam Gupta. K RAUSE , M C M AHAN , G UESTRIN AND G UPTA Keywords: observation selection, experimental design, active learning, submodular functions, Gaussi

5 0.48859447 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett

Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random fields ∗. These authors contributed equally. c 2008 Michael Col

6 0.4872874 86 jmlr-2008-SimpleMKL

7 0.48715147 7 jmlr-2008-A Tutorial on Conformal Prediction

8 0.47976956 56 jmlr-2008-Magic Moments for Structured Output Prediction

9 0.47765476 57 jmlr-2008-Manifold Learning: The Price of Normalization

10 0.47623542 58 jmlr-2008-Max-margin Classification of Data with Absent Features

11 0.47605088 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

12 0.47360477 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

13 0.47346678 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

14 0.47254521 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

15 0.47048014 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

16 0.47004893 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

17 0.46980321 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

18 0.46697229 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies

19 0.46669284 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

20 0.46309105 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks