nips nips2013 nips2013-346 knowledge-graph by maker-knowledge-mining

346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Source: pdf

Author: Michalis Titsias, Miguel Lazaro-Gredilla

Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 es Abstract We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. [sent-6, score-0.36]

2 This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. [sent-7, score-0.584]

3 We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. [sent-8, score-0.153]

4 In particular, the commonly used procedure is to ﬁnd point estimates over the kernel hyperparameters by maximizing the marginal likelihood, which is the likelihood obtained once the latent variables associated with the GP function have been integrated out (Rasmussen and Williams, 2006). [sent-11, score-0.384]

5 Such a procedure provides a practical algorithm that is expected to be robust to overﬁtting when the number of hyperparameters that need to be tuned are relatively few compared to the amount of data. [sent-12, score-0.196]

6 In contrast, when the number of hyperparameters is large this approach will suffer from the shortcomings of a typical maximum likelihood method such as overﬁtting. [sent-13, score-0.196]

7 To avoid the above problems, in GP models, the use of kernel functions with few kernel hyperparameters is common practice, although this can lead to limited ﬂexibility when modelling the data. [sent-14, score-0.436]

8 On the other hand, while full Bayesian inference for GP models could be useful, it is pragmatically a very challenging task that currently has been attempted only by using expensive MCMC techniques such as the recent method of Murray and Adams (2010). [sent-16, score-0.064]

9 Deterministic approximations and particularly the variational Bayes framework has not been applied so far for the treatment of kernel hyperparameters in GP models. [sent-17, score-0.557]

10 To this end, in this work we introduce a variational method for approximate Bayesian inference over hyperparameters in GP regression models with squared exponential kernel functions. [sent-18, score-0.631]

11 This approach consists of a novel variant of the variational framework introduced in (Titsias and Lawrence, 2010) for the Gaussian process latent variable model. [sent-19, score-0.308]

12 Furthermore, this method uses the concept of a standardised GP process and allows for learning Mahalanobis distance metrics (Weinberger and Saul, 2009; Xing et al. [sent-20, score-0.325]

13 , 2003) in Gaussian process regression settings using Bayesian inference. [sent-21, score-0.075]

14 The remainder of this paper is organised as follows: Section 2 provides the motivation and theoretical foundation of the variational method, Section 3 demonstrates the method in a number of challenging regression datasets by providing also a comprehensive comparison with existing methods. [sent-23, score-0.298]

15 1 discusses Bayesian GP regression and motivates the variational method. [sent-26, score-0.296]

16 2 explains the concept of the standardised representation of a GP model that is used by the variational method described in Section 2. [sent-28, score-0.492]

17 4 discusses setting the prior over the kernel hyperparameters together with a computationally analytical way to reduce the number of parameters to be optimised during variational inference. [sent-31, score-0.638]

18 1 Bayesian GP regression and motivation for the variational method Suppose we have data {yi , xi }n , where each xi ∈ RD and each yi is a real-valued scalar output. [sent-35, score-0.287]

19 In GP regression, we assume that each observed output is generated according to yi = f (xi ) + i , i ∼ N (0, σ 2 ), where the full length latent function f (x) is assigned a zero-mean GP prior with a certain covariance or kernel function kf (x, x ) that depends on hyperparameters θ. [sent-37, score-0.644]

20 Throughout the paper we will consider the following squared exponential kernel function 1 T 2 kf (x, x ) = σf e− 2 (x−x ) WT W(x−x ) 1 2 1 2 2 2 = σf e− 2 ||Wx−Wx || = σf e− 2 dW (x,x ) , (1) where dW (x, x ) = ||Wx − Wx ||. [sent-38, score-0.292]

21 In the special case where W is a square and diagonal matrix, the above kernel function reduces to D 2 2 1 2 (2) kf (x, x ) = σf e− 2 d=1 wd (xd −xd ) , which consists of the well-known ARD squared exponential kernel commonly used in GP regression applications (Rasmussen and Williams, 2006). [sent-40, score-0.463]

22 , 2003) that allows for supervised dimensionality reduction to be applied in a GP regression setting (Vivarelli and Williams, 1998). [sent-42, score-0.137]

23 In a full Bayesian formulation, the hyperparameters θ = (σf , W) are assigned a prior distribution p(θ) and the Bayesian model follows the hierarchical structure depicted in Figure 1(a). [sent-43, score-0.29]

24 According to this structure the random function f (x) and the hyperparameters θ are a priori coupled since the former quantity is generated conditional on the latter. [sent-44, score-0.24]

25 This can make approximate, and in particular variational, inference over the hyperparameters to be troublesome. [sent-45, score-0.218]

26 To clarify this, observe that the joint density induced by the ﬁnite data is p(y, f , θ) = N (y|f , σ 2 I)N (f |0, Kf ,f )p(θ), (3) where the vector f stores the latent function values at inputs X and Kf ,f is the n × n kernel matrix obtained by evaluating the kernel function on those inputs. [sent-46, score-0.411]

27 Clearly, in the term N (f |0, Kf ,f ) the hyperparameters θ appear non-linearly inside the inverse and determinant of the kernel matrix Kf ,f . [sent-47, score-0.361]

28 This is because the augmentation with auxiliary variables used in (Titsias and Lawrence, 2010), that allows to bypass the intractable term N (f |0, Kf ,f ), leads to an inversion of a matrix Ku,u that still depends on the kernel hyperparameters. [sent-49, score-0.198]

29 More precisely, the Ku,u matrix is deﬁned on auxiliary values u comprising points of the function f (x) at some arbitrary and freely optimisable inputs (Snelson and Ghahramani, 2006a; Titsias, 2009). [sent-50, score-0.211]

30 While this kernel matrix does not depend on the inputs X any more (which need to be integrated out in the GP-LVM case), it still depends on θ, making a possible variational treatment of those parameters 2 intractable. [sent-51, score-0.434]

31 Such an approach makes use of the socalled standardised representation of the GP model that is described next. [sent-54, score-0.276]

32 The above GP is referred to as standardised process, whereas a sample path s(z) is referred to as a standardised function. [sent-57, score-0.588]

33 The interesting property that a standardised process has is that it does not depend on kernel hyperparameters since it is deﬁned in a space where all hyperparameters have been neutralised to take the value one. [sent-58, score-0.79]

34 The above simply says that the value of f (x) at a certain input x is the value of the standardised function s(z), for z = Wx ∈ RK , times a global scale σf that changes the amplitude or power of the new function. [sent-60, score-0.28]

35 Given (σf , W), the above assumptions induce a GP prior on the function f (x), which has zero mean and the following kernel function 1 2 2 kf (x, x ) = E[σf s(Wx)σf s(Wx )] = σf e− 2 dW (x,x ) , (6) which is precisely the kernel function given in eq. [sent-61, score-0.458]

36 Nevertheless, the representation using the standardised process also implies a reparametrisation of the GP regression model where a priori the hyperparameters θ and the GP function are independent. [sent-64, score-0.568]

37 More precisely, one can now represent the GP model according to the following structure: s(z) ∼ GP(0, ks (z, z )), θ ∼ p(θ) f (x) = σf s(Wx) yi ∼ N (yi |f (xi ), σ 2 ), i = 1, . [sent-65, score-0.078]

38 The interesting property of this representation is that the GP function s(z) and the hyperparameters θ interact only inside the likelihood function while a priori are independent. [sent-69, score-0.262]

39 Next we discuss the details of this variational method. [sent-71, score-0.216]

40 3 Variational inference using auxiliary variables We deﬁne a set of m auxiliary variables u ∈ Rm such that each ui is a value of the standardised function so that ui = s(zi ) and the input zi ∈ RK lives in dimension K. [sent-73, score-0.376]

41 , zm ) are referred to as inducing inputs and consist of freely-optimisable parameters that can improve the accuracy of the approximation. [sent-77, score-0.152]

42 The inducing variables u follow the Gaussian density p(u) = N (u|0, Ku,u ), (8) where [Ku,u ]ij = ks (zi , zj ) and ks is the standardised kernel function given by eq. [sent-78, score-0.593]

43 Notice that the density p(u) does not depend on the kernel hyperparameters and particularly on the matrix W. [sent-80, score-0.368]

44 This is a rather critical point, that essentially allows the variational method to be applicable, and comprise the novelty of our method compared to the initial framework in (Titsias and Lawrence, 2010). [sent-81, score-0.216]

45 The vector f of noise-free latent function values, such that [f ]i = σf s(Wxi ), covary with the vector u based on the cross-covariance function 1 2 kf,u (x, z) = E[σf s(Wx)s(z)] = σf E[s(Wx)s(z)] = σf e− 2 ||Wx−z|| = σf ks (Wx, z). [sent-82, score-0.126]

46 The panel in (b) shows an equivalent representation of the GP model expressed through the standardised random function s(z), that does not depend on hyperparameters, and interacts with the hyperparameters at the data generation process. [sent-89, score-0.495]

47 The panel in (c) shows how the latent dimensionality of the Puma dataset is inferred to be 4, roughly corresponding to input dimensions 4, 5, 15 and 16 (see Section 3. [sent-91, score-0.223]

48 (3), after a marginalisation over the inducing variables, i. [sent-94, score-0.073]

49 We would like now to apply variational inference in the augmented joint model1 p(y, f , u, W) = N (y|f , σ 2 I)p(f |u, W)p(u)p(W), in order to approximate the intractable posterior distribution p(f , W, u|y). [sent-97, score-0.259]

50 We introduce the variational distribution q(f , W, u) = p(f |u, W)q(W)q(u), (10) where p(f |u, W) is the conditional GP prior that appears in the joint model, q(u) is a free-form variational distribution that after optimisation is found to be Gaussian (see Section B. [sent-98, score-0.533]

51 1 of the supplementary material, whereas the KL divergence term KL(q(W)||p(W)) that depends on the prior distribution over W is described in the next section. [sent-100, score-0.124]

52 The variational lower bound is maximised using gradient-based methods over the variational pa2 rameters {µkd , σkd }K,D k=1,d=1 , the inducing inputs Z (which are also variational parameters) and the hyperparameters (σf , σ 2 ). [sent-101, score-0.968]

53 Notice that the treatment of (σf , σ 2 ) with a Bayesian manner is easier and approximate inference could be done with the standard conjugate variational Bayesian framework (Bishop, 2006). [sent-103, score-0.263]

54 Learning the set of variances { 2 }K can allow to automatically select the dimensionality k k=1 associated with the Mahalanobis distance metric dW (x, x ). [sent-107, score-0.109]

55 This could be carried out by either applying a Type II ML estimation procedure or a variational Bayesian approach, where the latter assigns a conjugate Gamma prior on the variances and optimises a variational distribution q({ 2 }K ) k k=1 over them. [sent-108, score-0.515]

56 The optimisable quantities in both these procedures can be removed analytically and optimally from the variational lower bound as described next. [sent-109, score-0.319]

57 (12) which can be written in the form: KL = 1 2 K D d=1 2 σdk + µ2 dk 2 k k=1 D −D− 2 σdk log 2 k d=1 . [sent-112, score-0.09]

58 By ﬁrst minimizing this term with respect to these former hyperparameters we ﬁnd that D d=1 2 σdk + µ2 dk , k = 1, . [sent-113, score-0.286]

59 , K, D and then by substituting back these optimal values into the KL divergence we obtain 2 k KL = 1 2 K = D D 2 log σdk − D log k=1 (14) d=1 2 σdk + µ2 dk + D log D , (15) d=1 which now depends only on variational parameters. [sent-116, score-0.33]

60 Then, by followk k k ing a similar procedure as the one above we can remove optimally the variational factor q({ 2 }K ) k k=1 (see Section B. [sent-118, score-0.216]

61 2 in the supplementary material) to obtain KL = − D +α 2 K D 2 µ2 + σkd kd log 2β + k=1 d=1 + 1 2 K D 2 log(σkd ) + const, (16) k=1 d=1 which, as expected, has the nice property that when α = β = 0, so that the prior over variances becomes improper, it reduces to the quantity in (15). [sent-119, score-0.18]

62 Finally, it is important to notice that different and particularly non-Gaussian priors for the parameters W can be also accommodated by our variational method. [sent-120, score-0.239]

63 More precisely, any alternative prior for W changes only the form of the negative KL divergence term in the lower bound in eq. [sent-121, score-0.076]

64 In the experiments we have used the ARD prior described above while the investigation of alternative priors is intended to be studied as a future work. [sent-124, score-0.075]

65 Furthermore, although the predictive density is not Gaussian, its mean and variance can be computed analytically as explained in Section B. [sent-132, score-0.055]

66 We will use as benchmarks a full GP with automatic relevance determination (ARD) and the stateof-the-art SPGP-DR model, which is described below. [sent-135, score-0.119]

67 Also, see Section A of the supplementary material for an example of dimensionality reduction on a simple toy example. [sent-136, score-0.135]

68 This model is sometimes referred to as FITC (fully independent training conditional) and uses an active set of m pseudo-inputs that control the speed vs. [sent-139, score-0.063]

69 SPGP is often used when dealing with datasets containing more than a few thousand samples, since in those cases the cost of a full GP becomes impractical. [sent-141, score-0.073]

70 In Snelson and Ghahramani (2006b), a version of SPGP with dimensionality reduction (SPGP-DR) is presented. [sent-142, score-0.086]

71 First, SPGP’s pseudo-inputs are model parameters and, as such, ﬁtting a large number of them can result in overﬁtting, whereas the inducing inputs used in VDMGP are variational parameters whose optimisation can only result in a better ﬁt of the posterior densities. [sent-148, score-0.411]

72 Second, SPGP-DR does not place a prior on the linear projection matrix W; it is instead ﬁtted using Maximum Likelihood, just as the pseudo-inputs. [sent-149, score-0.106]

73 In contrast, VDMGP does place a prior on W and variationally integrates it out. [sent-150, score-0.106]

74 These differences yield an important consequence: VDMGP can infer automatically the latent dimensionality K of data, but SPGP-DR is unable to, since increasing K is never going to decrease its likelihood. [sent-151, score-0.121]

75 Thus, VDMGP follows Occam’s razor on the number of latent dimensions K. [sent-152, score-0.121]

76 We ran SPGP-DR and VDMGP using the same exact initialisation for the projection matrix on both algorithms and tested the effect of using a reduced number of training data. [sent-232, score-0.089]

77 For SPGP-DR we tested several possible latent dimensions K = {2, 5, 10, 15, 20, 30}, whereas for VDMGP we ﬁxed K = 20 and let the model infer the number of dimensions. [sent-233, score-0.145]

78 The number of inducing variables (pseudo-inputs for SPGP-DR) was set to 10 for Temp and 20 for SO2 . [sent-234, score-0.073]

79 The multiple dotted blue lines correspond to SPGP-DR with different choices of latent dimensionality K. [sent-238, score-0.121]

80 The dashed black line represents the full GP, which has been run for training sets up to size 2000. [sent-239, score-0.077]

81 When feasible, the full GP performs best, but since it requires the inversion of the full kernel matrix, it cannot by applied to large-scale problems such as the ones considered in this subsection. [sent-242, score-0.223]

82 Also, even in reasonably-sized problems, the full GP may run into trouble if several noise-only input dimensions are present. [sent-243, score-0.151]

83 SPGP-DR works well for large training set sizes, since there is enough information for it to avoid overﬁtting and the advantage of using a prior on W is reduced. [sent-244, score-0.087]

84 uk/˜gcc/competition/ Temp: 106 dimensions 7117/3558 training/testing data, SO2 : 27 dimensions 15304/7652 training/testing data. [sent-249, score-0.106]

85 Finally, VDMGP results in scalable performance: It is able to perform dimensionality reduction and achieve high accuracy both on small and large datasets, while still being faster than a full GP. [sent-253, score-0.128]

86 Labels represent angular accelerations of one of the robot arm’s links, which have to be predicted based on the angular positions, velocities and torques of the robot arm. [sent-257, score-0.118]

87 It is well-known from previous works (Snelson and Ghahramani, 2006a) that only 4 out of the 32 input dimensions are relevant for the prediction task, and that identifying them is not always easy. [sent-259, score-0.079]

88 In particular, SPGP (the standard version, with no dimensionality reduction), fails at this task unless initialised from a “good guess” about the relevant dimensions coming from a different model, as discussed in (Snelson and Ghahramani, 2006a). [sent-260, score-0.106]

89 We thought it would be interesting to assess the performance of the discussed models on this dataset, again considering different training set sizes, which are generated by randomly sampling from the training set. [sent-261, score-0.091]

90 VDMGPR determines that there are 4 latent dimensions, as shown in Figure 1(c). [sent-263, score-0.068]

91 However, since the computation of the variational bound of VDMGP involves more steps than the computation of the evidence of SPGPDR, VDMGP is slower than SPGP-DR. [sent-267, score-0.216]

92 In two typical cases using 500 and 5000 training points full GP runs in 0. [sent-268, score-0.077]

93 24 seconds (for 500 training points) and in 34 seconds (for 5000 training points), VDMGP runs in 0. [sent-269, score-0.11]

94 4 Discussion and further work A typical approach to regression when the number of input dimensions is large is to ﬁrst use a linear projection of input data to reduce dimensionality (e. [sent-274, score-0.241]

95 Instead of approaching this method in two steps, a monolithic approach allows the dimensionality reduction to be tailored to the speciﬁc regression problem. [sent-277, score-0.137]

96 In this work we have shown that it is possible to variationally integrate out the linear projection of the inputs of a GP, which, as a particular case, corresponds to integrating out its length-scale hyperparameters. [sent-278, score-0.117]

97 Only two parameters (noise variance and scale) are free in this model, whereas the remaining parameters appearing in the bound are free variational parameters, and optimizing them can only result in improved posterior estimates. [sent-280, score-0.261]

98 This allows us to automatically infer the number of latent dimensions that are needed for regression in a given problem, which is also not possible using SPGP-DR. [sent-281, score-0.172]

99 Variable noise and dimensionality reduction for sparse Gaussian processes. [sent-343, score-0.086]

100 Variational learning of inducing variables in sparse Gaussian processes. [sent-353, score-0.073]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gp', 0.495), ('vdmgp', 0.41), ('standardised', 0.254), ('spgp', 0.224), ('variational', 0.216), ('wx', 0.198), ('hyperparameters', 0.196), ('titsias', 0.159), ('temp', 0.156), ('kf', 0.146), ('nlpd', 0.137), ('snelson', 0.134), ('kernel', 0.12), ('puma', 0.117), ('nmse', 0.104), ('ard', 0.098), ('mahalanobis', 0.091), ('dk', 0.09), ('optimisable', 0.078), ('vivarelli', 0.078), ('dw', 0.076), ('ghahramani', 0.074), ('kd', 0.073), ('inducing', 0.073), ('kl', 0.07), ('latent', 0.068), ('lawrence', 0.067), ('dr', 0.065), ('williams', 0.06), ('ks', 0.058), ('dimensionality', 0.053), ('dimensions', 0.053), ('prior', 0.052), ('bayesian', 0.051), ('regression', 0.051), ('inputs', 0.051), ('gaussian', 0.049), ('bishop', 0.047), ('relevance', 0.044), ('full', 0.042), ('vdmgpr', 0.039), ('wkd', 0.039), ('auxiliary', 0.037), ('training', 0.035), ('df', 0.035), ('variationally', 0.034), ('saul', 0.034), ('reduction', 0.033), ('determination', 0.033), ('projection', 0.032), ('rk', 0.032), ('variances', 0.031), ('datasets', 0.031), ('density', 0.03), ('rasmussen', 0.03), ('angular', 0.03), ('trouble', 0.03), ('robot', 0.029), ('discusses', 0.029), ('mackay', 0.028), ('referred', 0.028), ('weinberger', 0.027), ('input', 0.026), ('tting', 0.026), ('optimisation', 0.026), ('miguel', 0.026), ('tipping', 0.026), ('squared', 0.026), ('material', 0.025), ('treatment', 0.025), ('analytical', 0.025), ('analytically', 0.025), ('distance', 0.025), ('supplementary', 0.024), ('divergence', 0.024), ('process', 0.024), ('whereas', 0.024), ('freely', 0.023), ('priors', 0.023), ('inside', 0.023), ('panel', 0.023), ('xing', 0.023), ('competition', 0.023), ('conditional', 0.023), ('matrix', 0.022), ('figures', 0.022), ('murray', 0.022), ('inference', 0.022), ('metrics', 0.022), ('representation', 0.022), ('assess', 0.021), ('posterior', 0.021), ('priori', 0.021), ('precisely', 0.02), ('neal', 0.02), ('integrates', 0.02), ('acknowledges', 0.02), ('yi', 0.02), ('seconds', 0.02), ('inversion', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Author: Michalis Titsias, Miguel Lazaro-Gredilla

2 0.29597008 54 nips-2013-Bayesian optimization explains human active search

Author: Ali Borji, Laurent Itti

Abstract: Many real-world problems have complicated objective functions. To optimize such functions, humans utilize sophisticated sequential decision-making strategies. Many optimization algorithms have also been developed for this same purpose, but how do they compare to humans in terms of both performance and behavior? We try to unravel the general underlying algorithm people may be using while searching for the maximum of an invisible 1D function. Subjects click on a blank screen and are shown the ordinate of the function at each clicked abscissa location. Their task is to ﬁnd the function’s maximum in as few clicks as possible. Subjects win if they get close enough to the maximum location. Analysis over 23 non-maths undergraduates, optimizing 25 functions from different families, shows that humans outperform 24 well-known optimization algorithms. Bayesian Optimization based on Gaussian Processes, which exploits all the x values tried and all the f (x) values obtained so far to pick the next x, predicts human performance and searched locations better. In 6 follow-up controlled experiments over 76 subjects, covering interpolation, extrapolation, and optimization tasks, we further conﬁrm that Gaussian Processes provide a general and uniﬁed theoretical account to explain passive and active function learning and search in humans. 1

3 0.24120674 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression

Author: Yanshuai Cao, Marcus A. Brubaker, David Fleet, Aaron Hertzmann

Abstract: We propose an efﬁcient optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates an inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-ofart performance in discrete cases and competitive results in the continuous case. 1

4 0.2233869 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen

Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identiﬁcation) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a ﬂexible model able to capture complex dynamical phenomena. To enable efﬁcient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1

5 0.19699278 201 nips-2013-Multi-Task Bayesian Optimization

Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams

Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efﬁciency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to ﬁnd optimal hyperparameter settings more efﬁciently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method signiﬁcantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1

6 0.16828489 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

7 0.10848574 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

8 0.097847797 143 nips-2013-Integrated Non-Factorized Variational Inference

9 0.088164695 137 nips-2013-High-Dimensional Gaussian Process Bandits

10 0.083531946 75 nips-2013-Convex Two-Layer Modeling

11 0.081837729 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

12 0.077428751 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

13 0.077285334 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

14 0.07549569 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

15 0.066700481 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

16 0.061230268 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

17 0.05963219 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

18 0.058084473 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

19 0.05689846 224 nips-2013-On the Sample Complexity of Subspace Learning

20 0.056305856 317 nips-2013-Streaming Variational Bayes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.171), (1, 0.057), (2, -0.027), (3, -0.009), (4, -0.027), (5, 0.092), (6, 0.132), (7, 0.08), (8, 0.078), (9, -0.088), (10, -0.134), (11, -0.039), (12, -0.21), (13, 0.024), (14, 0.013), (15, 0.017), (16, -0.129), (17, 0.088), (18, -0.038), (19, -0.286), (20, 0.109), (21, -0.12), (22, -0.156), (23, 0.118), (24, 0.001), (25, -0.019), (26, -0.044), (27, -0.028), (28, 0.019), (29, 0.133), (30, -0.232), (31, 0.043), (32, 0.003), (33, 0.05), (34, 0.13), (35, 0.104), (36, 0.059), (37, -0.06), (38, -0.036), (39, -0.062), (40, 0.015), (41, -0.094), (42, 0.019), (43, -0.016), (44, 0.004), (45, -0.06), (46, 0.032), (47, 0.072), (48, -0.022), (49, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93183279 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Author: Michalis Titsias, Miguel Lazaro-Gredilla

2 0.83838964 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression

Author: Yanshuai Cao, Marcus A. Brubaker, David Fleet, Aaron Hertzmann

3 0.80180329 54 nips-2013-Bayesian optimization explains human active search

Author: Ali Borji, Laurent Itti

4 0.72262871 201 nips-2013-Multi-Task Bayesian Optimization

Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams

5 0.56482184 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

Author: Andreas Ruttor, Philipp Batz, Manfred Opper

Abstract: We introduce a nonparametric approach for estimating drift functions in systems of stochastic differential equations from sparse observations of the state vector. Using a Gaussian process prior over the drift as a function of the state vector, we develop an approximate EM algorithm to deal with the unobserved, latent dynamics between observations. The posterior over states is approximated by a piecewise linearized process of the Ornstein-Uhlenbeck type and the MAP estimation of the drift is facilitated by a sparse Gaussian process regression. 1

6 0.55674607 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

7 0.47179967 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

8 0.40531942 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

9 0.37922639 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models

10 0.3649421 86 nips-2013-Demixing odors - fast inference in olfaction

11 0.35291141 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

12 0.35020515 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

13 0.34465608 137 nips-2013-High-Dimensional Gaussian Process Bandits

14 0.33936396 143 nips-2013-Integrated Non-Factorized Variational Inference

15 0.33817708 317 nips-2013-Streaming Variational Bayes

16 0.33772168 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

17 0.32579491 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking

18 0.3198469 41 nips-2013-Approximate inference in latent Gaussian-Markov models from continuous time observations

19 0.31474483 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks

20 0.30648217 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.012), (16, 0.033), (33, 0.115), (34, 0.227), (41, 0.021), (49, 0.039), (56, 0.093), (70, 0.015), (74, 0.243), (85, 0.04), (89, 0.03), (93, 0.042), (95, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8458935 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering

Author: Shinichi Nakajima, Akiko Takeda, S. Derin Babacan, Masashi Sugiyama, Ichiro Takeuchi

Abstract: When a probabilistic model and its prior are given, Bayesian learning offers inference with automatic parameter tuning. However, Bayesian learning is often obstructed by computational difﬁculty: the rigorous Bayesian learning is intractable in many models, and its variational Bayesian (VB) approximation is prone to suffer from local minima. In this paper, we overcome this difﬁculty for low-rank subspace clustering (LRSC) by providing an exact global solver and its efﬁcient approximation. LRSC extracts a low-dimensional structure of data by embedding samples into the union of low-dimensional subspaces, and its variational Bayesian variant has shown good performance. We ﬁrst prove a key property that the VBLRSC model is highly redundant. Thanks to this property, the optimization problem of VB-LRSC can be separated into small subproblems, each of which has only a small number of unknown variables. Our exact global solver relies on another key property that the stationary condition of each subproblem consists of a set of polynomial equations, which is solvable with the homotopy method. For further computational efﬁciency, we also propose an efﬁcient approximate variant, of which the stationary condition can be written as a polynomial equation with a single variable. Experimental results show the usefulness of our approach. 1

same-paper 2 0.82234716 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Author: Michalis Titsias, Miguel Lazaro-Gredilla

3 0.79773676 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data

Author: Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko

Abstract: We consider the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples. 1

4 0.79561186 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis

Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a ﬁnite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Speciﬁcally, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1

5 0.78291529 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel

Author: Ping Li, Gennady Samorodnitsk, John Hopcroft

Abstract: The method of stable random projections is useful for efﬁciently approximating the lα distance (0 < α ≤ 2) in high dimension and it is naturally suitable for data streams. In this paper, we propose to use only the signs of the projected data and we analyze the probability of collision (i.e., when the two signs differ). Interestingly, when α = 1 (i.e., Cauchy random projections), we show that the probability of collision can be accurately approximated as functions of the chi-square (χ2 ) similarity. In text and vision applications, the χ2 similarity is a popular measure when the features are generated from histograms (which are a typical example of data streams). Experiments conﬁrm that the proposed method is promising for large-scale learning applications. The full paper is available at arXiv:1308.1009. There are many future research problems. For example, when α → 0, the collision probability is a function of the resemblance (of the binary-quantized data). This provides an effective mechanism for resemblance estimation in data streams. 1

6 0.75764722 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

7 0.74829203 210 nips-2013-Noise-Enhanced Associative Memories

8 0.7465201 122 nips-2013-First-order Decomposition Trees

9 0.74603677 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

10 0.74595469 202 nips-2013-Multiclass Total Variation Clustering

11 0.7450732 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

12 0.74437505 143 nips-2013-Integrated Non-Factorized Variational Inference

13 0.73932922 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory

14 0.73905301 129 nips-2013-Generalized Random Utility Models with Multiple Types

15 0.73754126 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

16 0.73539209 256 nips-2013-Probabilistic Principal Geodesic Analysis

17 0.71514648 347 nips-2013-Variational Planning for Graph-based MDPs

18 0.71415305 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

19 0.70369524 86 nips-2013-Demixing odors - fast inference in olfaction

20 0.70262176 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning