nips nips2013 nips2013-346 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michalis Titsias, Miguel Lazaro-Gredilla
Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1
Reference: text
sentIndex sentText sentNum sentScore
1 es Abstract We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. [sent-6, score-0.36]
2 This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. [sent-7, score-0.584]
3 We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. [sent-8, score-0.153]
4 In particular, the commonly used procedure is to find point estimates over the kernel hyperparameters by maximizing the marginal likelihood, which is the likelihood obtained once the latent variables associated with the GP function have been integrated out (Rasmussen and Williams, 2006). [sent-11, score-0.384]
5 Such a procedure provides a practical algorithm that is expected to be robust to overfitting when the number of hyperparameters that need to be tuned are relatively few compared to the amount of data. [sent-12, score-0.196]
6 In contrast, when the number of hyperparameters is large this approach will suffer from the shortcomings of a typical maximum likelihood method such as overfitting. [sent-13, score-0.196]
7 To avoid the above problems, in GP models, the use of kernel functions with few kernel hyperparameters is common practice, although this can lead to limited flexibility when modelling the data. [sent-14, score-0.436]
8 On the other hand, while full Bayesian inference for GP models could be useful, it is pragmatically a very challenging task that currently has been attempted only by using expensive MCMC techniques such as the recent method of Murray and Adams (2010). [sent-16, score-0.064]
9 Deterministic approximations and particularly the variational Bayes framework has not been applied so far for the treatment of kernel hyperparameters in GP models. [sent-17, score-0.557]
10 To this end, in this work we introduce a variational method for approximate Bayesian inference over hyperparameters in GP regression models with squared exponential kernel functions. [sent-18, score-0.631]
11 This approach consists of a novel variant of the variational framework introduced in (Titsias and Lawrence, 2010) for the Gaussian process latent variable model. [sent-19, score-0.308]
12 Furthermore, this method uses the concept of a standardised GP process and allows for learning Mahalanobis distance metrics (Weinberger and Saul, 2009; Xing et al. [sent-20, score-0.325]
13 , 2003) in Gaussian process regression settings using Bayesian inference. [sent-21, score-0.075]
14 The remainder of this paper is organised as follows: Section 2 provides the motivation and theoretical foundation of the variational method, Section 3 demonstrates the method in a number of challenging regression datasets by providing also a comprehensive comparison with existing methods. [sent-23, score-0.298]
15 1 discusses Bayesian GP regression and motivates the variational method. [sent-26, score-0.296]
16 2 explains the concept of the standardised representation of a GP model that is used by the variational method described in Section 2. [sent-28, score-0.492]
17 4 discusses setting the prior over the kernel hyperparameters together with a computationally analytical way to reduce the number of parameters to be optimised during variational inference. [sent-31, score-0.638]
18 1 Bayesian GP regression and motivation for the variational method Suppose we have data {yi , xi }n , where each xi ∈ RD and each yi is a real-valued scalar output. [sent-35, score-0.287]
19 In GP regression, we assume that each observed output is generated according to yi = f (xi ) + i , i ∼ N (0, σ 2 ), where the full length latent function f (x) is assigned a zero-mean GP prior with a certain covariance or kernel function kf (x, x ) that depends on hyperparameters θ. [sent-37, score-0.644]
20 Throughout the paper we will consider the following squared exponential kernel function 1 T 2 kf (x, x ) = σf e− 2 (x−x ) WT W(x−x ) 1 2 1 2 2 2 = σf e− 2 ||Wx−Wx || = σf e− 2 dW (x,x ) , (1) where dW (x, x ) = ||Wx − Wx ||. [sent-38, score-0.292]
21 In the special case where W is a square and diagonal matrix, the above kernel function reduces to D 2 2 1 2 (2) kf (x, x ) = σf e− 2 d=1 wd (xd −xd ) , which consists of the well-known ARD squared exponential kernel commonly used in GP regression applications (Rasmussen and Williams, 2006). [sent-40, score-0.463]
22 , 2003) that allows for supervised dimensionality reduction to be applied in a GP regression setting (Vivarelli and Williams, 1998). [sent-42, score-0.137]
23 In a full Bayesian formulation, the hyperparameters θ = (σf , W) are assigned a prior distribution p(θ) and the Bayesian model follows the hierarchical structure depicted in Figure 1(a). [sent-43, score-0.29]
24 According to this structure the random function f (x) and the hyperparameters θ are a priori coupled since the former quantity is generated conditional on the latter. [sent-44, score-0.24]
25 This can make approximate, and in particular variational, inference over the hyperparameters to be troublesome. [sent-45, score-0.218]
26 To clarify this, observe that the joint density induced by the finite data is p(y, f , θ) = N (y|f , σ 2 I)N (f |0, Kf ,f )p(θ), (3) where the vector f stores the latent function values at inputs X and Kf ,f is the n × n kernel matrix obtained by evaluating the kernel function on those inputs. [sent-46, score-0.411]
27 Clearly, in the term N (f |0, Kf ,f ) the hyperparameters θ appear non-linearly inside the inverse and determinant of the kernel matrix Kf ,f . [sent-47, score-0.361]
28 This is because the augmentation with auxiliary variables used in (Titsias and Lawrence, 2010), that allows to bypass the intractable term N (f |0, Kf ,f ), leads to an inversion of a matrix Ku,u that still depends on the kernel hyperparameters. [sent-49, score-0.198]
29 More precisely, the Ku,u matrix is defined on auxiliary values u comprising points of the function f (x) at some arbitrary and freely optimisable inputs (Snelson and Ghahramani, 2006a; Titsias, 2009). [sent-50, score-0.211]
30 While this kernel matrix does not depend on the inputs X any more (which need to be integrated out in the GP-LVM case), it still depends on θ, making a possible variational treatment of those parameters 2 intractable. [sent-51, score-0.434]
31 Such an approach makes use of the socalled standardised representation of the GP model that is described next. [sent-54, score-0.276]
32 The above GP is referred to as standardised process, whereas a sample path s(z) is referred to as a standardised function. [sent-57, score-0.588]
33 The interesting property that a standardised process has is that it does not depend on kernel hyperparameters since it is defined in a space where all hyperparameters have been neutralised to take the value one. [sent-58, score-0.79]
34 The above simply says that the value of f (x) at a certain input x is the value of the standardised function s(z), for z = Wx ∈ RK , times a global scale σf that changes the amplitude or power of the new function. [sent-60, score-0.28]
35 Given (σf , W), the above assumptions induce a GP prior on the function f (x), which has zero mean and the following kernel function 1 2 2 kf (x, x ) = E[σf s(Wx)σf s(Wx )] = σf e− 2 dW (x,x ) , (6) which is precisely the kernel function given in eq. [sent-61, score-0.458]
36 Nevertheless, the representation using the standardised process also implies a reparametrisation of the GP regression model where a priori the hyperparameters θ and the GP function are independent. [sent-64, score-0.568]
37 More precisely, one can now represent the GP model according to the following structure: s(z) ∼ GP(0, ks (z, z )), θ ∼ p(θ) f (x) = σf s(Wx) yi ∼ N (yi |f (xi ), σ 2 ), i = 1, . [sent-65, score-0.078]
38 The interesting property of this representation is that the GP function s(z) and the hyperparameters θ interact only inside the likelihood function while a priori are independent. [sent-69, score-0.262]
39 Next we discuss the details of this variational method. [sent-71, score-0.216]
40 3 Variational inference using auxiliary variables We define a set of m auxiliary variables u ∈ Rm such that each ui is a value of the standardised function so that ui = s(zi ) and the input zi ∈ RK lives in dimension K. [sent-73, score-0.376]
41 , zm ) are referred to as inducing inputs and consist of freely-optimisable parameters that can improve the accuracy of the approximation. [sent-77, score-0.152]
42 The inducing variables u follow the Gaussian density p(u) = N (u|0, Ku,u ), (8) where [Ku,u ]ij = ks (zi , zj ) and ks is the standardised kernel function given by eq. [sent-78, score-0.593]
43 Notice that the density p(u) does not depend on the kernel hyperparameters and particularly on the matrix W. [sent-80, score-0.368]
44 This is a rather critical point, that essentially allows the variational method to be applicable, and comprise the novelty of our method compared to the initial framework in (Titsias and Lawrence, 2010). [sent-81, score-0.216]
45 The vector f of noise-free latent function values, such that [f ]i = σf s(Wxi ), covary with the vector u based on the cross-covariance function 1 2 kf,u (x, z) = E[σf s(Wx)s(z)] = σf E[s(Wx)s(z)] = σf e− 2 ||Wx−z|| = σf ks (Wx, z). [sent-82, score-0.126]
46 The panel in (b) shows an equivalent representation of the GP model expressed through the standardised random function s(z), that does not depend on hyperparameters, and interacts with the hyperparameters at the data generation process. [sent-89, score-0.495]
47 The panel in (c) shows how the latent dimensionality of the Puma dataset is inferred to be 4, roughly corresponding to input dimensions 4, 5, 15 and 16 (see Section 3. [sent-91, score-0.223]
48 (3), after a marginalisation over the inducing variables, i. [sent-94, score-0.073]
49 We would like now to apply variational inference in the augmented joint model1 p(y, f , u, W) = N (y|f , σ 2 I)p(f |u, W)p(u)p(W), in order to approximate the intractable posterior distribution p(f , W, u|y). [sent-97, score-0.259]
50 We introduce the variational distribution q(f , W, u) = p(f |u, W)q(W)q(u), (10) where p(f |u, W) is the conditional GP prior that appears in the joint model, q(u) is a free-form variational distribution that after optimisation is found to be Gaussian (see Section B. [sent-98, score-0.533]
51 1 of the supplementary material, whereas the KL divergence term KL(q(W)||p(W)) that depends on the prior distribution over W is described in the next section. [sent-100, score-0.124]
52 The variational lower bound is maximised using gradient-based methods over the variational pa2 rameters {µkd , σkd }K,D k=1,d=1 , the inducing inputs Z (which are also variational parameters) and the hyperparameters (σf , σ 2 ). [sent-101, score-0.968]
53 Notice that the treatment of (σf , σ 2 ) with a Bayesian manner is easier and approximate inference could be done with the standard conjugate variational Bayesian framework (Bishop, 2006). [sent-103, score-0.263]
54 Learning the set of variances { 2 }K can allow to automatically select the dimensionality k k=1 associated with the Mahalanobis distance metric dW (x, x ). [sent-107, score-0.109]
55 This could be carried out by either applying a Type II ML estimation procedure or a variational Bayesian approach, where the latter assigns a conjugate Gamma prior on the variances and optimises a variational distribution q({ 2 }K ) k k=1 over them. [sent-108, score-0.515]
56 The optimisable quantities in both these procedures can be removed analytically and optimally from the variational lower bound as described next. [sent-109, score-0.319]
57 (12) which can be written in the form: KL = 1 2 K D d=1 2 σdk + µ2 dk 2 k k=1 D −D− 2 σdk log 2 k d=1 . [sent-112, score-0.09]
58 By first minimizing this term with respect to these former hyperparameters we find that D d=1 2 σdk + µ2 dk , k = 1, . [sent-113, score-0.286]
59 , K, D and then by substituting back these optimal values into the KL divergence we obtain 2 k KL = 1 2 K = D D 2 log σdk − D log k=1 (14) d=1 2 σdk + µ2 dk + D log D , (15) d=1 which now depends only on variational parameters. [sent-116, score-0.33]
60 Then, by followk k k ing a similar procedure as the one above we can remove optimally the variational factor q({ 2 }K ) k k=1 (see Section B. [sent-118, score-0.216]
61 2 in the supplementary material) to obtain KL = − D +α 2 K D 2 µ2 + σkd kd log 2β + k=1 d=1 + 1 2 K D 2 log(σkd ) + const, (16) k=1 d=1 which, as expected, has the nice property that when α = β = 0, so that the prior over variances becomes improper, it reduces to the quantity in (15). [sent-119, score-0.18]
62 Finally, it is important to notice that different and particularly non-Gaussian priors for the parameters W can be also accommodated by our variational method. [sent-120, score-0.239]
63 More precisely, any alternative prior for W changes only the form of the negative KL divergence term in the lower bound in eq. [sent-121, score-0.076]
64 In the experiments we have used the ARD prior described above while the investigation of alternative priors is intended to be studied as a future work. [sent-124, score-0.075]
65 Furthermore, although the predictive density is not Gaussian, its mean and variance can be computed analytically as explained in Section B. [sent-132, score-0.055]
66 We will use as benchmarks a full GP with automatic relevance determination (ARD) and the stateof-the-art SPGP-DR model, which is described below. [sent-135, score-0.119]
67 Also, see Section A of the supplementary material for an example of dimensionality reduction on a simple toy example. [sent-136, score-0.135]
68 This model is sometimes referred to as FITC (fully independent training conditional) and uses an active set of m pseudo-inputs that control the speed vs. [sent-139, score-0.063]
69 SPGP is often used when dealing with datasets containing more than a few thousand samples, since in those cases the cost of a full GP becomes impractical. [sent-141, score-0.073]
70 In Snelson and Ghahramani (2006b), a version of SPGP with dimensionality reduction (SPGP-DR) is presented. [sent-142, score-0.086]
71 First, SPGP’s pseudo-inputs are model parameters and, as such, fitting a large number of them can result in overfitting, whereas the inducing inputs used in VDMGP are variational parameters whose optimisation can only result in a better fit of the posterior densities. [sent-148, score-0.411]
72 Second, SPGP-DR does not place a prior on the linear projection matrix W; it is instead fitted using Maximum Likelihood, just as the pseudo-inputs. [sent-149, score-0.106]
73 In contrast, VDMGP does place a prior on W and variationally integrates it out. [sent-150, score-0.106]
74 These differences yield an important consequence: VDMGP can infer automatically the latent dimensionality K of data, but SPGP-DR is unable to, since increasing K is never going to decrease its likelihood. [sent-151, score-0.121]
75 Thus, VDMGP follows Occam’s razor on the number of latent dimensions K. [sent-152, score-0.121]
76 We ran SPGP-DR and VDMGP using the same exact initialisation for the projection matrix on both algorithms and tested the effect of using a reduced number of training data. [sent-232, score-0.089]
77 For SPGP-DR we tested several possible latent dimensions K = {2, 5, 10, 15, 20, 30}, whereas for VDMGP we fixed K = 20 and let the model infer the number of dimensions. [sent-233, score-0.145]
78 The number of inducing variables (pseudo-inputs for SPGP-DR) was set to 10 for Temp and 20 for SO2 . [sent-234, score-0.073]
79 The multiple dotted blue lines correspond to SPGP-DR with different choices of latent dimensionality K. [sent-238, score-0.121]
80 The dashed black line represents the full GP, which has been run for training sets up to size 2000. [sent-239, score-0.077]
81 When feasible, the full GP performs best, but since it requires the inversion of the full kernel matrix, it cannot by applied to large-scale problems such as the ones considered in this subsection. [sent-242, score-0.223]
82 Also, even in reasonably-sized problems, the full GP may run into trouble if several noise-only input dimensions are present. [sent-243, score-0.151]
83 SPGP-DR works well for large training set sizes, since there is enough information for it to avoid overfitting and the advantage of using a prior on W is reduced. [sent-244, score-0.087]
84 uk/˜gcc/competition/ Temp: 106 dimensions 7117/3558 training/testing data, SO2 : 27 dimensions 15304/7652 training/testing data. [sent-249, score-0.106]
85 Finally, VDMGP results in scalable performance: It is able to perform dimensionality reduction and achieve high accuracy both on small and large datasets, while still being faster than a full GP. [sent-253, score-0.128]
86 Labels represent angular accelerations of one of the robot arm’s links, which have to be predicted based on the angular positions, velocities and torques of the robot arm. [sent-257, score-0.118]
87 It is well-known from previous works (Snelson and Ghahramani, 2006a) that only 4 out of the 32 input dimensions are relevant for the prediction task, and that identifying them is not always easy. [sent-259, score-0.079]
88 In particular, SPGP (the standard version, with no dimensionality reduction), fails at this task unless initialised from a “good guess” about the relevant dimensions coming from a different model, as discussed in (Snelson and Ghahramani, 2006a). [sent-260, score-0.106]
89 We thought it would be interesting to assess the performance of the discussed models on this dataset, again considering different training set sizes, which are generated by randomly sampling from the training set. [sent-261, score-0.091]
90 VDMGPR determines that there are 4 latent dimensions, as shown in Figure 1(c). [sent-263, score-0.068]
91 However, since the computation of the variational bound of VDMGP involves more steps than the computation of the evidence of SPGPDR, VDMGP is slower than SPGP-DR. [sent-267, score-0.216]
92 In two typical cases using 500 and 5000 training points full GP runs in 0. [sent-268, score-0.077]
93 24 seconds (for 500 training points) and in 34 seconds (for 5000 training points), VDMGP runs in 0. [sent-269, score-0.11]
94 4 Discussion and further work A typical approach to regression when the number of input dimensions is large is to first use a linear projection of input data to reduce dimensionality (e. [sent-274, score-0.241]
95 Instead of approaching this method in two steps, a monolithic approach allows the dimensionality reduction to be tailored to the specific regression problem. [sent-277, score-0.137]
96 In this work we have shown that it is possible to variationally integrate out the linear projection of the inputs of a GP, which, as a particular case, corresponds to integrating out its length-scale hyperparameters. [sent-278, score-0.117]
97 Only two parameters (noise variance and scale) are free in this model, whereas the remaining parameters appearing in the bound are free variational parameters, and optimizing them can only result in improved posterior estimates. [sent-280, score-0.261]
98 This allows us to automatically infer the number of latent dimensions that are needed for regression in a given problem, which is also not possible using SPGP-DR. [sent-281, score-0.172]
99 Variable noise and dimensionality reduction for sparse Gaussian processes. [sent-343, score-0.086]
100 Variational learning of inducing variables in sparse Gaussian processes. [sent-353, score-0.073]
wordName wordTfidf (topN-words)
[('gp', 0.495), ('vdmgp', 0.41), ('standardised', 0.254), ('spgp', 0.224), ('variational', 0.216), ('wx', 0.198), ('hyperparameters', 0.196), ('titsias', 0.159), ('temp', 0.156), ('kf', 0.146), ('nlpd', 0.137), ('snelson', 0.134), ('kernel', 0.12), ('puma', 0.117), ('nmse', 0.104), ('ard', 0.098), ('mahalanobis', 0.091), ('dk', 0.09), ('optimisable', 0.078), ('vivarelli', 0.078), ('dw', 0.076), ('ghahramani', 0.074), ('kd', 0.073), ('inducing', 0.073), ('kl', 0.07), ('latent', 0.068), ('lawrence', 0.067), ('dr', 0.065), ('williams', 0.06), ('ks', 0.058), ('dimensionality', 0.053), ('dimensions', 0.053), ('prior', 0.052), ('bayesian', 0.051), ('regression', 0.051), ('inputs', 0.051), ('gaussian', 0.049), ('bishop', 0.047), ('relevance', 0.044), ('full', 0.042), ('vdmgpr', 0.039), ('wkd', 0.039), ('auxiliary', 0.037), ('training', 0.035), ('df', 0.035), ('variationally', 0.034), ('saul', 0.034), ('reduction', 0.033), ('determination', 0.033), ('projection', 0.032), ('rk', 0.032), ('variances', 0.031), ('datasets', 0.031), ('density', 0.03), ('rasmussen', 0.03), ('angular', 0.03), ('trouble', 0.03), ('robot', 0.029), ('discusses', 0.029), ('mackay', 0.028), ('referred', 0.028), ('weinberger', 0.027), ('input', 0.026), ('tting', 0.026), ('optimisation', 0.026), ('miguel', 0.026), ('tipping', 0.026), ('squared', 0.026), ('material', 0.025), ('treatment', 0.025), ('analytical', 0.025), ('analytically', 0.025), ('distance', 0.025), ('supplementary', 0.024), ('divergence', 0.024), ('process', 0.024), ('whereas', 0.024), ('freely', 0.023), ('priors', 0.023), ('inside', 0.023), ('panel', 0.023), ('xing', 0.023), ('competition', 0.023), ('conditional', 0.023), ('matrix', 0.022), ('figures', 0.022), ('murray', 0.022), ('inference', 0.022), ('metrics', 0.022), ('representation', 0.022), ('assess', 0.021), ('posterior', 0.021), ('priori', 0.021), ('precisely', 0.02), ('neal', 0.02), ('integrates', 0.02), ('acknowledges', 0.02), ('yi', 0.02), ('seconds', 0.02), ('inversion', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Author: Michalis Titsias, Miguel Lazaro-Gredilla
Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1
2 0.29597008 54 nips-2013-Bayesian optimization explains human active search
Author: Ali Borji, Laurent Itti
Abstract: Many real-world problems have complicated objective functions. To optimize such functions, humans utilize sophisticated sequential decision-making strategies. Many optimization algorithms have also been developed for this same purpose, but how do they compare to humans in terms of both performance and behavior? We try to unravel the general underlying algorithm people may be using while searching for the maximum of an invisible 1D function. Subjects click on a blank screen and are shown the ordinate of the function at each clicked abscissa location. Their task is to find the function’s maximum in as few clicks as possible. Subjects win if they get close enough to the maximum location. Analysis over 23 non-maths undergraduates, optimizing 25 functions from different families, shows that humans outperform 24 well-known optimization algorithms. Bayesian Optimization based on Gaussian Processes, which exploits all the x values tried and all the f (x) values obtained so far to pick the next x, predicts human performance and searched locations better. In 6 follow-up controlled experiments over 76 subjects, covering interpolation, extrapolation, and optimization tasks, we further confirm that Gaussian Processes provide a general and unified theoretical account to explain passive and active function learning and search in humans. 1
3 0.24120674 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression
Author: Yanshuai Cao, Marcus A. Brubaker, David Fleet, Aaron Hertzmann
Abstract: We propose an efficient optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates an inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-ofart performance in discrete cases and competitive results in the continuous case. 1
4 0.2233869 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen
Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identification) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. To enable efficient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1
5 0.19699278 201 nips-2013-Multi-Task Bayesian Optimization
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
6 0.16828489 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
7 0.10848574 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
8 0.097847797 143 nips-2013-Integrated Non-Factorized Variational Inference
9 0.088164695 137 nips-2013-High-Dimensional Gaussian Process Bandits
10 0.083531946 75 nips-2013-Convex Two-Layer Modeling
11 0.081837729 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
12 0.077428751 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
13 0.077285334 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
14 0.07549569 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
15 0.066700481 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
16 0.061230268 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
17 0.05963219 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
18 0.058084473 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
19 0.05689846 224 nips-2013-On the Sample Complexity of Subspace Learning
20 0.056305856 317 nips-2013-Streaming Variational Bayes
topicId topicWeight
[(0, 0.171), (1, 0.057), (2, -0.027), (3, -0.009), (4, -0.027), (5, 0.092), (6, 0.132), (7, 0.08), (8, 0.078), (9, -0.088), (10, -0.134), (11, -0.039), (12, -0.21), (13, 0.024), (14, 0.013), (15, 0.017), (16, -0.129), (17, 0.088), (18, -0.038), (19, -0.286), (20, 0.109), (21, -0.12), (22, -0.156), (23, 0.118), (24, 0.001), (25, -0.019), (26, -0.044), (27, -0.028), (28, 0.019), (29, 0.133), (30, -0.232), (31, 0.043), (32, 0.003), (33, 0.05), (34, 0.13), (35, 0.104), (36, 0.059), (37, -0.06), (38, -0.036), (39, -0.062), (40, 0.015), (41, -0.094), (42, 0.019), (43, -0.016), (44, 0.004), (45, -0.06), (46, 0.032), (47, 0.072), (48, -0.022), (49, 0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.93183279 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Author: Michalis Titsias, Miguel Lazaro-Gredilla
Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1
2 0.83838964 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression
Author: Yanshuai Cao, Marcus A. Brubaker, David Fleet, Aaron Hertzmann
Abstract: We propose an efficient optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates an inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-ofart performance in discrete cases and competitive results in the continuous case. 1
3 0.80180329 54 nips-2013-Bayesian optimization explains human active search
Author: Ali Borji, Laurent Itti
Abstract: Many real-world problems have complicated objective functions. To optimize such functions, humans utilize sophisticated sequential decision-making strategies. Many optimization algorithms have also been developed for this same purpose, but how do they compare to humans in terms of both performance and behavior? We try to unravel the general underlying algorithm people may be using while searching for the maximum of an invisible 1D function. Subjects click on a blank screen and are shown the ordinate of the function at each clicked abscissa location. Their task is to find the function’s maximum in as few clicks as possible. Subjects win if they get close enough to the maximum location. Analysis over 23 non-maths undergraduates, optimizing 25 functions from different families, shows that humans outperform 24 well-known optimization algorithms. Bayesian Optimization based on Gaussian Processes, which exploits all the x values tried and all the f (x) values obtained so far to pick the next x, predicts human performance and searched locations better. In 6 follow-up controlled experiments over 76 subjects, covering interpolation, extrapolation, and optimization tasks, we further confirm that Gaussian Processes provide a general and unified theoretical account to explain passive and active function learning and search in humans. 1
4 0.72262871 201 nips-2013-Multi-Task Bayesian Optimization
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
5 0.56482184 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
Author: Andreas Ruttor, Philipp Batz, Manfred Opper
Abstract: We introduce a nonparametric approach for estimating drift functions in systems of stochastic differential equations from sparse observations of the state vector. Using a Gaussian process prior over the drift as a function of the state vector, we develop an approximate EM algorithm to deal with the unobserved, latent dynamics between observations. The posterior over states is approximated by a piecewise linearized process of the Ornstein-Uhlenbeck type and the MAP estimation of the drift is facilitated by a sparse Gaussian process regression. 1
6 0.55674607 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
7 0.47179967 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
8 0.40531942 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
9 0.37922639 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models
10 0.3649421 86 nips-2013-Demixing odors - fast inference in olfaction
11 0.35291141 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex
12 0.35020515 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
13 0.34465608 137 nips-2013-High-Dimensional Gaussian Process Bandits
14 0.33936396 143 nips-2013-Integrated Non-Factorized Variational Inference
15 0.33817708 317 nips-2013-Streaming Variational Bayes
16 0.33772168 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
18 0.3198469 41 nips-2013-Approximate inference in latent Gaussian-Markov models from continuous time observations
19 0.31474483 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
20 0.30648217 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
topicId topicWeight
[(2, 0.012), (16, 0.033), (33, 0.115), (34, 0.227), (41, 0.021), (49, 0.039), (56, 0.093), (70, 0.015), (74, 0.243), (85, 0.04), (89, 0.03), (93, 0.042), (95, 0.015)]
simIndex simValue paperId paperTitle
Author: Shinichi Nakajima, Akiko Takeda, S. Derin Babacan, Masashi Sugiyama, Ichiro Takeuchi
Abstract: When a probabilistic model and its prior are given, Bayesian learning offers inference with automatic parameter tuning. However, Bayesian learning is often obstructed by computational difficulty: the rigorous Bayesian learning is intractable in many models, and its variational Bayesian (VB) approximation is prone to suffer from local minima. In this paper, we overcome this difficulty for low-rank subspace clustering (LRSC) by providing an exact global solver and its efficient approximation. LRSC extracts a low-dimensional structure of data by embedding samples into the union of low-dimensional subspaces, and its variational Bayesian variant has shown good performance. We first prove a key property that the VBLRSC model is highly redundant. Thanks to this property, the optimization problem of VB-LRSC can be separated into small subproblems, each of which has only a small number of unknown variables. Our exact global solver relies on another key property that the stationary condition of each subproblem consists of a set of polynomial equations, which is solvable with the homotopy method. For further computational efficiency, we also propose an efficient approximate variant, of which the stationary condition can be written as a polynomial equation with a single variable. Experimental results show the usefulness of our approach. 1
same-paper 2 0.82234716 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Author: Michalis Titsias, Miguel Lazaro-Gredilla
Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1
3 0.79773676 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data
Author: Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko
Abstract: We consider the problem of embedding entities and relationships of multirelational data in low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples. 1
4 0.79561186 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis
Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1
5 0.78291529 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel
Author: Ping Li, Gennady Samorodnitsk, John Hopcroft
Abstract: The method of stable random projections is useful for efficiently approximating the lα distance (0 < α ≤ 2) in high dimension and it is naturally suitable for data streams. In this paper, we propose to use only the signs of the projected data and we analyze the probability of collision (i.e., when the two signs differ). Interestingly, when α = 1 (i.e., Cauchy random projections), we show that the probability of collision can be accurately approximated as functions of the chi-square (χ2 ) similarity. In text and vision applications, the χ2 similarity is a popular measure when the features are generated from histograms (which are a typical example of data streams). Experiments confirm that the proposed method is promising for large-scale learning applications. The full paper is available at arXiv:1308.1009. There are many future research problems. For example, when α → 0, the collision probability is a function of the resemblance (of the binary-quantized data). This provides an effective mechanism for resemblance estimation in data streams. 1
6 0.75764722 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
7 0.74829203 210 nips-2013-Noise-Enhanced Associative Memories
8 0.7465201 122 nips-2013-First-order Decomposition Trees
9 0.74603677 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
10 0.74595469 202 nips-2013-Multiclass Total Variation Clustering
11 0.7450732 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
12 0.74437505 143 nips-2013-Integrated Non-Factorized Variational Inference
13 0.73932922 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory
14 0.73905301 129 nips-2013-Generalized Random Utility Models with Multiple Types
15 0.73754126 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
16 0.73539209 256 nips-2013-Probabilistic Principal Geodesic Analysis
17 0.71514648 347 nips-2013-Variational Planning for Graph-based MDPs
18 0.71415305 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
19 0.70369524 86 nips-2013-Demixing odors - fast inference in olfaction
20 0.70262176 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning