nips nips2011 nips2011-26 knowledge-graph by maker-knowledge-mining

26 nips-2011-Additive Gaussian Processes

Source: pdf

Author: David K. Duvenaud, Hannes Nickisch, Carl E. Rasmussen

Abstract: We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efﬁcient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We introduce a Gaussian process model of functions which are additive. [sent-7, score-0.15]

2 An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. [sent-8, score-0.538]

3 Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. [sent-9, score-0.091]

4 We introduce an expressive but tractable parameterization of the kernel function, which allows efﬁcient evaluation of all input interaction terms, whose number is exponential in the input dimension. [sent-11, score-0.818]

5 The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks. [sent-12, score-0.099]

6 1 Introduction Most statistical regression models in use today are of the form: g(y) = f (x1 )+f (x2 )+· · ·+f (xD ). [sent-13, score-0.161]

7 This family of functions, known as Generalized Additive Models (GAM) [2], are typically easy to ﬁt and interpret. [sent-15, score-0.039]

8 Some extensions of this family, such as smoothing-splines ANOVA [3], add terms depending on more than one variable. [sent-16, score-0.039]

9 However, such models generally become intractable and difﬁcult to ﬁt as the number of terms increases. [sent-17, score-0.041]

10 At the other end of the spectrum are kernel-based models, which typically allow the response to depend on all input variables simultaneously. [sent-18, score-0.121]

11 A popular example would be a Gaussian process model using a squared-exponential (or Gaussian) kernel. [sent-23, score-0.09]

12 This model is much more ﬂexible than the GAM, but its ﬂexibility makes it difﬁcult to generalize to new combinations of input variables. [sent-25, score-0.145]

13 In this paper, we introduce a Gaussian process model that generalizes both GAMs and the SE-GP. [sent-26, score-0.128]

14 This is achieved through a kernel which allow additive interactions of all orders, ranging from ﬁrst order interactions (as in a GAM) all the way to Dth-order interactions (as in a SE-GP). [sent-27, score-1.047]

15 Although this kernel amounts to a sum over an exponential number of terms, we show how to compute this kernel efﬁciently, and introduce a parameterization which limits the number of hyperparameters to O(D). [sent-28, score-1.021]

16 A Gaussian process with this kernel function (an additive GP) constitutes a powerful model that allows one to automatically determine which orders of interaction are important. [sent-29, score-0.962]

17 We note that a similar breakthrough has recently been made, called Hierarchical Kernel Learning (HKL) [4]. [sent-32, score-0.071]

18 HKL explores a similar class of models, and sidesteps the possibly exponential number of interaction terms by cleverly selecting only a tractable subset. [sent-33, score-0.317]

19 However, this method suffers considerably from the fact that cross-validation must be used to set hyperparameters. [sent-34, score-0.072]

20 In addition, the machinery necessary to train these models is immense. [sent-35, score-0.094]

21 Finally, on real datasets, HKL is outperformed by the standard SE-GP [4]. [sent-36, score-0.034]

22 2 0 4 0 4 2 4 2 0 + −2 −4 −4 0 4 4 = −2 −4 −4 −4 0 −2 −2 −4 −4 −4 k1 (x1 )k2 (x2 ) 2nd order kernel ↓ 4 2 3 1. [sent-52, score-0.366]

23 5 4 2 0 −2 k1 (x1 ) + k2 (x2 ) 1st order kernel ↓ 2 1 2 2 0 −2 k2 (x2 ) 1D kernel ↓ 1. [sent-54, score-0.703]

24 5 4 0 0 −2 k1 (x1 ) 1D kernel ↓ 2 2 0 0 −2 0. [sent-55, score-0.337]

25 Left: a draw from a ﬁrst-order additive kernel corresponds to a sum of draws from one-dimensional kernels. [sent-60, score-0.954]

26 Right: functions drawn from a product kernel prior have weaker long-range dependencies, and less long-range structure. [sent-61, score-0.538]

27 2 Gaussian Process Models Gaussian processes are a ﬂexible and tractable prior over functions, useful for solving regression and classiﬁcation tasks [5]. [sent-62, score-0.256]

28 The kind of structure which can be captured by a GP model is mainly determined by its kernel: the covariance function. [sent-63, score-0.093]

29 One of the main difﬁculties in specifying a Gaussian process model is in choosing a kernel which can represent the structure present in the data. [sent-64, score-0.429]

30 For small to medium-sized datasets, the kernel has a large impact on modeling efﬁcacy. [sent-65, score-0.367]

31 Figure 1 compares, for two-dimensional functions, a ﬁrst-order additive kernel with a second-order kernel. [sent-66, score-0.688]

32 We can see that a GP with a ﬁrst-order additive kernel is an example of a GAM: Each function drawn from this model is a sum of orthogonal one-dimensional functions. [sent-67, score-0.786]

33 Compared to functions drawn from the higher-order GP, draws from the ﬁrst-order GP have more long-range structure. [sent-68, score-0.177]

34 We can expect many natural functions to depend only on sums of low-order interactions. [sent-69, score-0.128]

35 For example, the price of a house or car will presumably be well approximated by a sum of prices of individual features, such as a sun-roof. [sent-70, score-0.35]

36 Other parts of the price may depend jointly on a small set of features, such as the size and building materials of a house. [sent-71, score-0.174]

37 Capturing these regularities will mean that a model can conﬁdently extrapolate to unseen combinations of features. [sent-72, score-0.225]

38 3 Additive Kernels We now give a precise deﬁnition of additive kernels. [sent-73, score-0.351]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gp', 0.436), ('hkl', 0.352), ('additive', 0.351), ('kernel', 0.337), ('gam', 0.285), ('xid', 0.176), ('ki', 0.158), ('draw', 0.129), ('xd', 0.105), ('parameterization', 0.102), ('interactions', 0.098), ('interaction', 0.086), ('orders', 0.084), ('draws', 0.081), ('price', 0.081), ('gaussian', 0.081), ('tractable', 0.079), ('dently', 0.077), ('kid', 0.077), ('breakthrough', 0.071), ('extrapolate', 0.071), ('regression', 0.07), ('prices', 0.067), ('anova', 0.067), ('mpi', 0.067), ('prior', 0.066), ('edward', 0.063), ('hannes', 0.063), ('cleverly', 0.061), ('xi', 0.059), ('regularities', 0.058), ('bingen', 0.058), ('exible', 0.056), ('combinations', 0.056), ('kj', 0.056), ('sum', 0.056), ('process', 0.055), ('cacy', 0.054), ('house', 0.054), ('materials', 0.054), ('generalized', 0.054), ('functions', 0.054), ('nickisch', 0.053), ('decomposes', 0.053), ('carl', 0.053), ('presumably', 0.053), ('gps', 0.053), ('machinery', 0.053), ('nth', 0.051), ('hn', 0.051), ('exponential', 0.05), ('generalize', 0.05), ('today', 0.05), ('constitutes', 0.049), ('interpretability', 0.046), ('expressive', 0.045), ('rasmussen', 0.044), ('culties', 0.044), ('spectrum', 0.043), ('drawn', 0.042), ('introduce', 0.041), ('processes', 0.041), ('models', 0.041), ('engineering', 0.041), ('xj', 0.041), ('hyperparameter', 0.041), ('germany', 0.041), ('explores', 0.041), ('unseen', 0.04), ('exibility', 0.04), ('hierarchical', 0.04), ('car', 0.039), ('depending', 0.039), ('family', 0.039), ('depend', 0.039), ('input', 0.039), ('weaker', 0.039), ('suffers', 0.039), ('specifying', 0.037), ('ranging', 0.036), ('sums', 0.035), ('limits', 0.035), ('popular', 0.035), ('intelligent', 0.035), ('outperformed', 0.034), ('hyperparameters', 0.033), ('kind', 0.033), ('considerably', 0.033), ('datasets', 0.032), ('generalizes', 0.032), ('captured', 0.031), ('capturing', 0.031), ('amounts', 0.03), ('compares', 0.03), ('cambridge', 0.03), ('impact', 0.03), ('order', 0.029), ('mainly', 0.029), ('assign', 0.029), ('increased', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 26 nips-2011-Additive Gaussian Processes

Author: David K. Duvenaud, Hannes Nickisch, Carl E. Rasmussen

2 0.26524282 100 nips-2011-Gaussian Process Training with Input Noise

Author: Andrew Mchutchon, Carl E. Rasmussen

Abstract: In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise variances are inferred from the data as extra hyperparameters. They are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods. 1

3 0.18009998 30 nips-2011-Algorithms for Hyper-Parameter Optimization

Author: James S. Bergstra, Rémi Bardenet, Yoshua Bengio, Balázs Kégl

Abstract: Several recent advances to the state of the art in image classiﬁcation benchmarks have come from better conﬁgurations of existing techniques rather than novel approaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efﬁcient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can ﬁnd better results. We present hyper-parameter optimization results on tasks of training neural networks and deep belief networks (DBNs). We optimize hyper-parameters using random search and two new greedy sequential methods based on the expected improvement criterion. Random search has been shown to be sufﬁciently efﬁcient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. The sequential algorithms are applied to the most difﬁcult DBN learning problems from [1] and ﬁnd signiﬁcantly better results than the best previously reported. This work contributes novel techniques for making response surface models P (y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements. 1

4 0.17848459 61 nips-2011-Contextual Gaussian Process Bandit Optimization

Author: Andreas Krause, Cheng S. Ong

Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process deﬁned over the joint context-action space, and develop CGP-UCB, an intuitive upper-conﬁdence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1

5 0.12359213 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

Author: Vikas Sindhwani, Aurelie C. Lozano

Abstract: We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the non-parametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse MKL. Unlike l1 -MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a black-box single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16]. 1

6 0.11951327 171 nips-2011-Metric Learning with Multiple Kernels

7 0.1160742 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

8 0.10285322 101 nips-2011-Gaussian process modulated renewal processes

9 0.092200756 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

10 0.086983666 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

11 0.086422987 301 nips-2011-Variational Gaussian Process Dynamical Systems

12 0.084936008 24 nips-2011-Active learning of neural response functions with Gaussian processes

13 0.080666602 139 nips-2011-Kernel Bayes' Rule

14 0.080046259 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

15 0.067700312 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

16 0.06609869 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

17 0.065772101 154 nips-2011-Learning person-object interactions for action recognition in still images

18 0.064439148 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

19 0.061731704 281 nips-2011-The Doubly Correlated Nonparametric Topic Model

20 0.059264451 54 nips-2011-Co-regularized Multi-view Spectral Clustering

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.147), (1, 0.019), (2, 0.034), (3, -0.039), (4, -0.05), (5, -0.065), (6, 0.073), (7, -0.033), (8, 0.122), (9, 0.135), (10, -0.247), (11, 0.028), (12, 0.065), (13, 0.108), (14, -0.107), (15, 0.234), (16, 0.085), (17, 0.198), (18, -0.205), (19, -0.139), (20, -0.018), (21, 0.081), (22, -0.126), (23, -0.113), (24, 0.052), (25, -0.001), (26, 0.018), (27, -0.048), (28, -0.005), (29, -0.137), (30, -0.057), (31, -0.012), (32, 0.17), (33, 0.114), (34, 0.077), (35, -0.003), (36, 0.002), (37, -0.044), (38, -0.019), (39, 0.043), (40, 0.052), (41, -0.019), (42, -0.002), (43, -0.05), (44, -0.018), (45, 0.006), (46, -0.023), (47, 0.02), (48, -0.015), (49, 0.118)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97041148 26 nips-2011-Additive Gaussian Processes

Author: David K. Duvenaud, Hannes Nickisch, Carl E. Rasmussen

2 0.79045635 100 nips-2011-Gaussian Process Training with Input Noise

Author: Andrew Mchutchon, Carl E. Rasmussen

3 0.7550236 30 nips-2011-Algorithms for Hyper-Parameter Optimization

Author: James S. Bergstra, Rémi Bardenet, Yoshua Bengio, Balázs Kégl

4 0.61453807 61 nips-2011-Contextual Gaussian Process Bandit Optimization

Author: Andreas Krause, Cheng S. Ong

5 0.51609832 139 nips-2011-Kernel Bayes' Rule

Author: Kenji Fukumizu, Le Song, Arthur Gretton

Abstract: A nonparametric kernel-based method for realizing Bayes’ rule is proposed, based on kernel representations of probabilities in reproducing kernel Hilbert spaces. The prior and conditional probabilities are expressed as empirical kernel mean and covariance operators, respectively, and the kernel mean of the posterior distribution is computed in the form of a weighted sample. The kernel Bayes’ rule can be applied to a wide variety of Bayesian inference problems: we demonstrate Bayesian computation without likelihood, and ﬁltering with a nonparametric statespace model. A consistency rate for the posterior estimate is established. 1

6 0.5062232 101 nips-2011-Gaussian process modulated renewal processes

7 0.46513078 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

8 0.42346701 24 nips-2011-Active learning of neural response functions with Gaussian processes

9 0.42107227 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

10 0.38722548 171 nips-2011-Metric Learning with Multiple Kernels

11 0.37372056 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

12 0.36018902 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning

13 0.34828597 301 nips-2011-Variational Gaussian Process Dynamical Systems

14 0.30663359 194 nips-2011-On Causal Discovery with Cyclic Additive Noise Models

15 0.2948252 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks

16 0.28672957 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

17 0.28300744 254 nips-2011-Similarity-based Learning via Data Driven Embeddings

18 0.27975842 225 nips-2011-Probabilistic amplitude and frequency demodulation

19 0.27914709 286 nips-2011-The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning

20 0.27776167 240 nips-2011-Robust Multi-Class Gaussian Process Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.011), (20, 0.013), (31, 0.04), (43, 0.728), (45, 0.033), (57, 0.036), (74, 0.011), (99, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98982781 26 nips-2011-Additive Gaussian Processes

Author: David K. Duvenaud, Hannes Nickisch, Carl E. Rasmussen

2 0.90181506 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty

Author: Shilin Ding, Grace Wahba, Xiaojin Zhu

Abstract: In discrete undirected graphical models, the conditional independence of node labels Y is speciﬁed by the graph structure. We study the case where there is another input random vector X (e.g. observed features) such that the distribution P (Y | X) is determined by functions of X that characterize the (higher-order) interactions among the Y ’s. The main contribution of this paper is to learn the graph structure and the functions conditioned on X at the same time. We prove that discrete undirected graphical models with feature X are equivalent to multivariate discrete models. The reparameterization of the potential functions in graphical models by conditional log odds ratios of the latter offers advantages in representation of the conditional independence structure. The functional spaces can be ﬂexibly determined by kernels. Additionally, we impose a Structure Lasso (SLasso) penalty on groups of functions to learn the graph structure. These groups with overlaps are designed to enforce hierarchical function selection. In this way, we are able to shrink higher order interactions to obtain a sparse graph structure. 1

3 0.89334536 282 nips-2011-The Fast Convergence of Boosting

Author: Matus J. Telgarsky

Abstract: This manuscript considers the convergence rate of boosting under a large class of losses, including the exponential and logistic losses, where the best previous rate of convergence was O(exp(1/✏2 )). First, it is established that the setting of weak learnability aids the entire class, granting a rate O(ln(1/✏)). Next, the (disjoint) conditions under which the inﬁmal empirical risk is attainable are characterized in terms of the sample and weak learning class, and a new proof is given for the known rate O(ln(1/✏)). Finally, it is established that any instance can be decomposed into two smaller instances resembling the two preceding special cases, yielding a rate O(1/✏), with a matching lower bound for the logistic loss. The principal technical hurdle throughout this work is the potential unattainability of the inﬁmal empirical risk; the technique for overcoming this barrier may be of general interest. 1

4 0.87777776 44 nips-2011-Bayesian Spike-Triggered Covariance Analysis

Author: Jonathan W. Pillow, Il M. Park

Abstract: Neurons typically respond to a restricted number of stimulus features within the high-dimensional space of natural stimuli. Here we describe an explicit modelbased interpretation of traditional estimators for a neuron’s multi-dimensional feature space, which allows for several important generalizations and extensions. First, we show that traditional estimators based on the spike-triggered average (STA) and spike-triggered covariance (STC) can be formalized in terms of the “expected log-likelihood” of a Linear-Nonlinear-Poisson (LNP) model with Gaussian stimuli. This model-based formulation allows us to deﬁne maximum-likelihood and Bayesian estimators that are statistically consistent and efﬁcient in a wider variety of settings, such as with naturalistic (non-Gaussian) stimuli. It also allows us to employ Bayesian methods for regularization, smoothing, sparsiﬁcation, and model comparison, and provides Bayesian conﬁdence intervals on model parameters. We describe an empirical Bayes method for selecting the number of features, and extend the model to accommodate an arbitrary elliptical nonlinear response function, which results in a more powerful and more ﬂexible model for feature space inference. We validate these methods using neural data recorded extracellularly from macaque primary visual cortex. 1

5 0.86763525 264 nips-2011-Sparse Recovery with Brownian Sensing

Author: Alexandra Carpentier, Odalric-ambrym Maillard, Rémi Munos

Abstract: We consider the problem of recovering the parameter α ∈ RK of a sparse function f (i.e. the number of non-zero entries of α is small compared to the number K of features) given noisy evaluations of f at a set of well-chosen sampling points. We introduce an additional randomization process, called Brownian sensing, based on the computation of stochastic integrals, which produces a Gaussian sensing matrix, for which good recovery properties are proven, independently on the number of sampling points N , even when the features are arbitrarily non-orthogonal. Under the assumption that f is H¨ lder continuous with exponent at least √ we proo 1/2, vide an estimate α of the parameter such that �α − α�2 = O(�η�2 / N ), where � � η is the observation noise. The method uses a set of sampling points uniformly distributed along a one-dimensional curve selected according to the features. We report numerical experiments illustrating our method. 1

6 0.8550511 288 nips-2011-Thinning Measurement Models and Questionnaire Design

7 0.65274179 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions

8 0.62260371 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure

9 0.61982101 123 nips-2011-How biased are maximum entropy models?

10 0.61021036 281 nips-2011-The Doubly Correlated Nonparametric Topic Model

11 0.59244978 132 nips-2011-Inferring Interaction Networks using the IBP applied to microRNA Target Prediction

12 0.58892131 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods

13 0.57081926 24 nips-2011-Active learning of neural response functions with Gaussian processes

14 0.56485909 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

15 0.56457943 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

16 0.56370449 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs

17 0.56035829 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

18 0.5539394 152 nips-2011-Learning in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint

19 0.55349743 306 nips-2011-t-divergence Based Approximate Inference

20 0.54725742 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis