nips nips2011 nips2011-100 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrew Mchutchon, Carl E. Rasmussen
Abstract: In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise variances are inferred from the data as extra hyperparameters. They are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract In standard Gaussian Process regression input locations are assumed to be noise free. [sent-5, score-0.526]
2 We present a simple yet effective GP model for training on input points corrupted by i. [sent-6, score-0.349]
3 This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. [sent-11, score-1.082]
4 The input noise variances are inferred from the data as extra hyperparameters. [sent-12, score-0.536]
5 They are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood. [sent-13, score-0.304]
6 Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. [sent-14, score-0.26]
7 Standard GP regression [1] makes two assumptions about the noise in datasets: firstly that measurements of input points, x, are noise-free, and, secondly, that output points, y, are corrupted by constant-variance Gaussian noise. [sent-20, score-0.703]
8 In this paper we look at datasets where the input measurements, as well as the output, are corrupted by noise. [sent-24, score-0.255]
9 If, as an approximation, we treat the input measurements as if they were deterministic, and inflate the corresponding output variance to compensate, this leads to the output noise variance varying across the input space, a feature often called heteroscedasticity. [sent-26, score-1.074]
10 One method for modelling datasets with input noise is, therefore, to hold the input measurements to be deterministic and then use a heteroscedastic GP model. [sent-27, score-1.011]
11 However, referring the input noise to the output in this way results in heteroscedasticity with a very particular structure. [sent-29, score-0.572]
12 This structure can be exploited to improve upon current heteroscedastic GP models for datasets with input noise. [sent-30, score-0.402]
13 One can imagine that in regions where a process is changing its output value rapidly, corrupted input measurements will have a much greater effect than in regions Pre-conference version 1 where the output is almost constant. [sent-31, score-0.515]
14 In other words, the effect of the input noise is related to the gradient of the function mapping input to output. [sent-32, score-0.659]
15 We fit a local linear model to the GP posterior mean about each training point. [sent-34, score-0.282]
16 The input noise variance can then be referred to the output, proportional to the square of the posterior mean function’s gradient. [sent-35, score-0.823]
17 This approach is particularly powerful in the case of time-series data where the output at time t becomes the input at time t + 1. [sent-36, score-0.237]
18 In this situation, input measurements are clearly not noise-free: the noise on a particular measurement is the same whether it is considered an input or output. [sent-37, score-0.706]
19 Furthermore, we can estimate the noise variance on each input dimension, which is often very useful for analysis. [sent-39, score-0.567]
20 [2], is to make the noise variance a random variable and attempt to estimate its form at the same time as estimating the posterior mean. [sent-42, score-0.56]
21 suggested using a second GP to model the noise level as a function of the input location. [sent-44, score-0.478]
22 Snelson and Ghahramani [6] suggest a different approach whereby the importance of points in a pseudo-training set can be varied, allowing the posterior variance to vary as well. [sent-49, score-0.303]
23 Although all of these methods could be applied to datasets with input noise, they are designed for a more general class of heteroscedastic problems and so none of them exploits the structure inherent in input noise datasets. [sent-51, score-0.852]
24 However, their method requires prior knowledge of the noise variance, rather than inferring it as we do in this paper. [sent-55, score-0.293]
25 Let x and y be a pair of measurements from a process, where x is a D dimensional input to the process and y is the corresponding scalar output. [sent-57, score-0.269]
26 In standard GP regression we assume that y is a noisy measurement of the actual output of the process y , ˜ y=y+ ˜ y (1) 2 where, y ∼ N 0, σy . [sent-58, score-0.255]
27 In our model, we further assume that the inputs are also noisy measurements of the actual input x, ˜ x=x+ x ˜ (2) where x ∼ N (0, Σx ). [sent-59, score-0.303]
28 ), we can write the output as a function of the input in the following form, y = f (˜ + x ) + y x (3) For a GP model the posterior distribution based on equation 3 is intractable. [sent-62, score-0.441]
29 Note that if the posterior mean gradient is constant across the input space the heteroscedasticity is removed and our model is essentially identical to a standard GP. [sent-83, score-0.475]
30 As the input noise levels are the same for each of the output dimensions, our model can use data from all of the outputs when learning the input noise variances. [sent-85, score-1.099]
31 Not only does this give more information about the noise variances without needing further input measurements but it also reduces over-fitting as the learnt noise variances must agree with all E output dimensions. [sent-86, score-1.06]
32 For time-series datasets (where the model has to predict the next state given the current), each dimension’s input and output noise variance can be constrained to be the same since the noise level on a measurement is independent of whether it is an input or output. [sent-87, score-1.189]
33 This further constraint increases the ability of the model to recover the actual noise variances. [sent-88, score-0.321]
34 3 Training Our model introduces an extra D hyperparameters compared to the standard GP - one noise variance hyperparameter per input dimension. [sent-90, score-0.815]
35 A major advantage of our model is that these hyperparameters can be trained alongside any others by maximisation of the marginal likelihood. [sent-91, score-0.332]
36 This approach automatically includes regularisation of the noise parameters and reduces the effect of over-fitting. [sent-92, score-0.319]
37 In order to calculate the marginal likelihood of the training data we need the posterior distribution, and the slope of its mean, at each of the training points. [sent-93, score-0.441]
38 Therefore, we define a two-step approach: first we ¯ evaluate a standard GP with the training data, using our initial hyperparameter settings and ignoring the input noise. [sent-95, score-0.3]
39 We then find the slope of the posterior mean of this GP at each of the training points and use it to add in the corrective variance term, diag{∆f Σx ∆T }. [sent-96, score-0.542]
40 The marginal likelihood of the GP with the corrected variance is then computed, along with its derivatives with respect to the initial hyperparameters, which include the input noise variances. [sent-98, score-0.676]
41 Figure 1c shows the GP posterior for the trained hyperparameters and shows how NIGP can reduce output noise level estimates by taking input noise into account. [sent-101, score-1.15]
42 (a) A standard GP posterior distribution can be computed from an initial set of hyperparameters and a training data set, shown by the blue crosses. [sent-104, score-0.364]
43 The gradients of the posterior mean at each training point can then be found analytically. [sent-105, score-0.254]
44 (b) The NIGP method increases the posterior variance by the square of the posterior mean slope multiplied by the current setting of the input noise variance hyperparameter. [sent-106, score-1.123]
45 (c) This plot shows the standard GP posterior using the newly trained hyperparameters. [sent-112, score-0.291]
46 Comparing to plot (a) shows that the output noise hyperparameter has been greatly reduced. [sent-113, score-0.45]
47 (d) This plot shows the NIGP fit - plot(c) with the input noise corrective variance term, diag{∆f Σx ∆T }. [sent-114, score-0.669]
48 To improve the fit further we can iterate this procedure: we use the slopes of the current trained NIGP, instead of a standard GP, to calculate the effect of the input noise, i. [sent-116, score-0.286]
49 4 Prediction We turn now to the task of making predictions at noisy input locations with our model. [sent-119, score-0.229]
50 We therefore use the trained hyperparameters and the training data to define a GP posterior mean, which we differentiate at each test point and each training point. [sent-121, score-0.495]
51 The posterior mean slope at the test points is only used to calculate the variance over observations, where we increase the predictive variance by the noise variances. [sent-123, score-0.914]
52 If a single test point is considered to have a Gaussian distribution and all the training points are certain then, although the GP posterior is unknown, its mean and variance can be calculated exactly [11]. [sent-125, score-0.467]
53 As our model estimates the input noise variance Σx during training, we can consider a test point to be Gaussian distributed: x∗ ∼ N (x∗ , Σx ). [sent-126, score-0.627]
54 This method is computationally slower than using equation 7 and is vulnerable to worse results if the learnt input noise variance Σx is very different from the true value. [sent-129, score-0.655]
55 ’s ‘most likely heteroscedastic GP’ (MLHGP) model, a state-of-the-art heteroscedastic GP model. [sent-133, score-0.414]
56 We used the squared exponential kernel with Automatic Relevance Determination, 1 2 k(xi , xj ) = σf exp − (xi − xj )T Λ−1 (xi − xj ) (12) 2 2 where Λ is a diagonal matrix of the squared lengthscale hyperparameters and σf is a signal variance hyperparameter. [sent-134, score-0.431]
57 While the predictive means are similar, both our model and MLHGP pinch in the variance around the low noise areas. [sent-156, score-0.584]
58 Our model correctly expands the variance around all steep areas whereas MLHGP can only do so where high noise is observed (see areas around x= -6 and x = 1). [sent-157, score-0.672]
59 This function was chosen as it has areas 5 of steep gradient and near flat gradient and thus suffers from the heteroscedastic problems we are trying to solve. [sent-160, score-0.366]
60 The standard GP model has to take into account the large noise seen around the steep sloped areas by assuming large noise everywhere, which leads to the much larger error bars. [sent-162, score-0.799]
61 Our model can recover the actual noise levels by taking the input noise into account. [sent-163, score-0.824]
62 Both our model and MLHGP pinch the variance in around the flat regions of the function and expand it around the steep areas. [sent-164, score-0.335]
63 For the example shown in figure 2 the standard GP estimated an output noise standard deviation of 0. [sent-165, score-0.483]
64 Our model also learnt an input noise standard deviation of 0. [sent-169, score-0.614]
65 MLHGP does not produce a single estimate of noise levels. [sent-172, score-0.293]
66 Part of the reason for our improvement over MLHGP can be seen around x = 1: our model has near-symmetric ‘horns’ in the variance around the corners of the square wave, whereas MLHGP only has one ‘horn’. [sent-178, score-0.292]
67 This is because in our model, the amount of noise expected is proportional to the derivative of the mean squared, which is the same for both sides of the square wave. [sent-179, score-0.445]
68 ’s model the noise is estimated from the training points themselves. [sent-181, score-0.425]
69 In this example the training points around x = 1 happen to have low noise and so the learnt variance is smaller. [sent-182, score-0.618]
70 This illustrates an important aspect of our model: the accuracy in plotting the varying effect of noise is only dependent on the accuracy of the mean posterior function and not on an extra, learnt noise model. [sent-184, score-0.86]
71 This means that our model typically requires fewer data points to achieve the same accuracy as MLHGP on input noise datasets. [sent-185, score-0.514]
72 The figure shows that NIGP performs very well on all the functions, always outperforming the standard GP when there is input noise and nearly always MLHGP; wherever there is a significant difference our model is favoured. [sent-195, score-0.514]
73 Training on all the outputs at once only gives an improvement for some of the functions, which suggests that, for the others, the input noise levels could be estimated from the individual functions alone. [sent-196, score-0.572]
74 These results show our model consistently calculates a more accurate predictive posterior variance than either a standard GP or a state-of-the-art heteroscedastic GP model. [sent-199, score-0.6]
75 In this situation the input and output noise variance will be the same. [sent-201, score-0.647]
76 We tested NIGP on a timeseries dataset and compared the two modes (with separate input and output noise hyperparameters and with combined) and also to standard GP regression (MLHGP was not available for multiple input dimensions). [sent-203, score-0.925]
77 All four versions of our model perform better than the 6 Negative log predictive posterior sin(x) Near−square wave 2 0. [sent-211, score-0.331]
78 7 Input noise standard deviation Near−square wave exp(−0. [sent-299, score-0.458]
79 1 Input noise standard deviation Input noise standard deviation 0. [sent-360, score-0.734]
80 7 Input noise standard deviation Figure 3: Comparison of models for suite of 6 test functions. [sent-412, score-0.429]
81 The plots show our model has lower negative log posterior predictive than standard GP on all the functions, particularly the exponentially decaying sine wave and the multiplication between tan and sin. [sent-418, score-0.409]
82 There is also a slight improvement using the combined noise levels although, again, the difference is contained within the error bars. [sent-422, score-0.377]
83 A better comparison between the two modes is to look at the input noise variance values recovered. [sent-423, score-0.619]
84 Both modes struggle to recover the correct noise level on the second dimension and this is probably why the angular velocity prediction performance shown in figure 4 is worse than the angle prediction performance. [sent-433, score-0.469]
85 indicate whether the model combined the input and output noise parameters or treated them separately. [sent-438, score-0.558]
86 icantly improved the recovered noise value although the difference between the two NIGP modes then shrank as there was sufficient information to correctly deduce the noise levels separately. [sent-440, score-0.691]
87 6 Conclusion The correct way of training on input points corrupted by Gaussian noise is to consider every input point as a Gaussian distribution. [sent-441, score-0.771]
88 In our model, we refer the input noise to the output by passing it through a local linear expansion. [sent-443, score-0.53]
89 This adds a term to the likelihood which is proportional to the squared posterior mean gradient. [sent-444, score-0.303]
90 Not only does this lead to tractable computations but it makes intuitive sense - input noise has a larger effect in areas where the function is changing its output rapidly. [sent-445, score-0.599]
91 It can make use of multiple outputs and can recover a noise variance parameter for each input dimension, which is often useful for analysis. [sent-448, score-0.605]
92 In our approximate model, exact inference can be performed as the model hyperparameters can be trained simultaneously by marginal likelihood maximisation. [sent-449, score-0.289]
93 A proper handling of time-series data would constrain the specific noise levels on each training point to be the same for when they are considered inputs and outputs. [sent-450, score-0.453]
94 By allowing input noise and fixing the input and output noise variances to be identical, our model is a computationally efficient alternative. [sent-452, score-1.059]
95 It is important to state that this model has been designed to tackle a particular situation, that of constant-variance input noise, and would not perform so well on a general heteroscedastic problem. [sent-454, score-0.392]
96 It could not be expected to improve over a standard GP on problems where noise levels are proportional to the function or input value for example. [sent-455, score-0.577]
97 We do not see this limitation as too restricting however, as we maintain that constant input noise situations (including those where this is a sufficient approximation) are reasonably common. [sent-456, score-0.45]
98 We would expect this to be a good approximation for any function providing the input noise levels are not too large (i. [sent-460, score-0.503]
99 In practice, we could require that the input noise level is not larger than the input characteristic length scale. [sent-463, score-0.607]
100 Variable noise and dimensionality reduction for sparse gaussian processes. [sent-491, score-0.342]
wordName wordTfidf (topN-words)
[('gp', 0.499), ('nigp', 0.441), ('mlhgp', 0.294), ('noise', 0.293), ('heteroscedastic', 0.207), ('input', 0.157), ('posterior', 0.15), ('kersting', 0.143), ('variance', 0.117), ('sin', 0.116), ('hyperparameters', 0.11), ('wave', 0.091), ('output', 0.08), ('measurements', 0.073), ('slope', 0.071), ('training', 0.068), ('trained', 0.067), ('corrective', 0.064), ('steep', 0.064), ('goldberg', 0.064), ('dtp', 0.063), ('initialisations', 0.063), ('stp', 0.063), ('learnt', 0.062), ('predictive', 0.062), ('edward', 0.06), ('angular', 0.06), ('pendulum', 0.06), ('corrupted', 0.06), ('levels', 0.053), ('modes', 0.052), ('variances', 0.051), ('marginal', 0.05), ('modelling', 0.05), ('gaussian', 0.049), ('rmse', 0.048), ('derivative', 0.046), ('diag', 0.045), ('normalised', 0.045), ('squared', 0.045), ('maximisation', 0.043), ('areas', 0.043), ('rasmussen', 0.042), ('around', 0.042), ('taylor', 0.042), ('copula', 0.042), ('dallaire', 0.042), ('heteroscedasticity', 0.042), ('lengthscale', 0.042), ('pinch', 0.042), ('procedings', 0.042), ('wilson', 0.042), ('tan', 0.042), ('angle', 0.041), ('regression', 0.04), ('inputs', 0.039), ('process', 0.039), ('hyperparameter', 0.039), ('deviation', 0.038), ('predictions', 0.038), ('outputs', 0.038), ('datasets', 0.038), ('proportional', 0.038), ('carl', 0.038), ('gps', 0.038), ('plot', 0.038), ('snelson', 0.037), ('standard', 0.036), ('deviations', 0.036), ('mean', 0.036), ('deterministic', 0.036), ('points', 0.036), ('extra', 0.035), ('noisy', 0.034), ('likelihood', 0.034), ('alongside', 0.034), ('et', 0.033), ('test', 0.032), ('square', 0.032), ('moments', 0.032), ('christopher', 0.031), ('improvement', 0.031), ('suite', 0.03), ('gure', 0.03), ('twenty', 0.029), ('calculated', 0.028), ('model', 0.028), ('zoubin', 0.027), ('fit', 0.027), ('tanh', 0.027), ('equation', 0.026), ('effect', 0.026), ('gradient', 0.026), ('measurement', 0.026), ('derivatives', 0.025), ('expansion', 0.024), ('xj', 0.024), ('wishart', 0.023), ('velocity', 0.023), ('cos', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 100 nips-2011-Gaussian Process Training with Input Noise
Author: Andrew Mchutchon, Carl E. Rasmussen
Abstract: In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise variances are inferred from the data as extra hyperparameters. They are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods. 1
2 0.26524282 26 nips-2011-Additive Gaussian Processes
Author: David K. Duvenaud, Hannes Nickisch, Carl E. Rasmussen
Abstract: We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks. 1
3 0.20701273 30 nips-2011-Algorithms for Hyper-Parameter Optimization
Author: James S. Bergstra, Rémi Bardenet, Yoshua Bengio, Balázs Kégl
Abstract: Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel approaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can find better results. We present hyper-parameter optimization results on tasks of training neural networks and deep belief networks (DBNs). We optimize hyper-parameters using random search and two new greedy sequential methods based on the expected improvement criterion. Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. The sequential algorithms are applied to the most difficult DBN learning problems from [1] and find significantly better results than the best previously reported. This work contributes novel techniques for making response surface models P (y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements. 1
4 0.14302708 61 nips-2011-Contextual Gaussian Process Bandit Optimization
Author: Andreas Krause, Cheng S. Ong
Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1
5 0.12589036 24 nips-2011-Active learning of neural response functions with Gaussian processes
Author: Mijung Park, Greg Horwitz, Jonathan W. Pillow
Abstract: A sizeable literature has focused on the problem of estimating a low-dimensional feature space for a neuron’s stimulus sensitivity. However, comparatively little work has addressed the problem of estimating the nonlinear function from feature space to spike rate. Here, we use a Gaussian process (GP) prior over the infinitedimensional space of nonlinear functions to obtain Bayesian estimates of the “nonlinearity” in the linear-nonlinear-Poisson (LNP) encoding model. This approach offers increased flexibility, robustness, and computational tractability compared to traditional methods (e.g., parametric forms, histograms, cubic splines). We then develop a framework for optimal experimental design under the GP-Poisson model using uncertainty sampling. This involves adaptively selecting stimuli according to an information-theoretic criterion, with the goal of characterizing the nonlinearity with as little experimental data as possible. Our framework relies on a method for rapidly updating hyperparameters under a Gaussian approximation to the posterior. We apply these methods to neural data from a color-tuned simple cell in macaque V1, characterizing its nonlinear response function in the 3D space of cone contrasts. We find that it combines cone inputs in a highly nonlinear manner. With simulated experiments, we show that optimal design substantially reduces the amount of data required to estimate these nonlinear combination rules. 1
6 0.11670841 101 nips-2011-Gaussian process modulated renewal processes
7 0.11535129 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
8 0.11037163 301 nips-2011-Variational Gaussian Process Dynamical Systems
9 0.10260055 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
10 0.096071519 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
11 0.091419727 131 nips-2011-Inference in continuous-time change-point models
12 0.083969377 217 nips-2011-Practical Variational Inference for Neural Networks
13 0.082029477 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
14 0.080998778 258 nips-2011-Sparse Bayesian Multi-Task Learning
15 0.077647246 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
16 0.068874255 89 nips-2011-Estimating time-varying input signals and ion channel states from a single voltage trace of a neuron
17 0.062960483 153 nips-2011-Learning large-margin halfspaces with more malicious noise
18 0.062174212 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
19 0.05587586 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
20 0.055011775 186 nips-2011-Noise Thresholds for Spectral Clustering
topicId topicWeight
[(0, 0.175), (1, 0.02), (2, 0.095), (3, -0.034), (4, -0.051), (5, -0.056), (6, 0.066), (7, -0.03), (8, 0.08), (9, 0.154), (10, -0.175), (11, -0.073), (12, 0.045), (13, 0.012), (14, -0.047), (15, 0.202), (16, 0.04), (17, 0.124), (18, -0.187), (19, -0.18), (20, 0.006), (21, 0.128), (22, -0.098), (23, -0.185), (24, 0.051), (25, 0.081), (26, 0.03), (27, 0.05), (28, 0.01), (29, -0.199), (30, -0.007), (31, 0.038), (32, 0.188), (33, 0.046), (34, 0.048), (35, -0.053), (36, 0.023), (37, 0.035), (38, -0.05), (39, 0.04), (40, -0.009), (41, 0.017), (42, 0.029), (43, -0.046), (44, -0.018), (45, -0.027), (46, -0.012), (47, 0.06), (48, -0.047), (49, 0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.95230508 100 nips-2011-Gaussian Process Training with Input Noise
Author: Andrew Mchutchon, Carl E. Rasmussen
Abstract: In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise variances are inferred from the data as extra hyperparameters. They are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods. 1
2 0.83431679 26 nips-2011-Additive Gaussian Processes
Author: David K. Duvenaud, Hannes Nickisch, Carl E. Rasmussen
Abstract: We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of low-dimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squared-exponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as state-of-the-art predictive power in regression tasks. 1
3 0.81592065 30 nips-2011-Algorithms for Hyper-Parameter Optimization
Author: James S. Bergstra, Rémi Bardenet, Yoshua Bengio, Balázs Kégl
Abstract: Several recent advances to the state of the art in image classification benchmarks have come from better configurations of existing techniques rather than novel approaches to feature learning. Traditionally, hyper-parameter optimization has been the job of humans because they can be very efficient in regimes where only a few trials are possible. Presently, computer clusters and GPU processors make it possible to run more trials and we show that algorithmic approaches can find better results. We present hyper-parameter optimization results on tasks of training neural networks and deep belief networks (DBNs). We optimize hyper-parameters using random search and two new greedy sequential methods based on the expected improvement criterion. Random search has been shown to be sufficiently efficient for learning neural networks for several datasets, but we show it is unreliable for training DBNs. The sequential algorithms are applied to the most difficult DBN learning problems from [1] and find significantly better results than the best previously reported. This work contributes novel techniques for making response surface models P (y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements. 1
4 0.60063517 101 nips-2011-Gaussian process modulated renewal processes
Author: Yee W. Teh, Vinayak Rao
Abstract: Renewal processes are generalizations of the Poisson process on the real line whose intervals are drawn i.i.d. from some distribution. Modulated renewal processes allow these interevent distributions to vary with time, allowing the introduction of nonstationarity. In this work, we take a nonparametric Bayesian approach, modelling this nonstationarity with a Gaussian process. Our approach is based on the idea of uniformization, which allows us to draw exact samples from an otherwise intractable distribution. We develop a novel and efficient MCMC sampler for posterior inference. In our experiments, we test these on a number of synthetic and real datasets. 1
5 0.55248922 61 nips-2011-Contextual Gaussian Process Bandit Optimization
Author: Andreas Krause, Cheng S. Ong
Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1
6 0.50592601 24 nips-2011-Active learning of neural response functions with Gaussian processes
7 0.48583493 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
8 0.43785298 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
9 0.43526772 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
10 0.43335015 131 nips-2011-Inference in continuous-time change-point models
11 0.43299928 301 nips-2011-Variational Gaussian Process Dynamical Systems
12 0.42946306 225 nips-2011-Probabilistic amplitude and frequency demodulation
13 0.40883103 139 nips-2011-Kernel Bayes' Rule
14 0.40715083 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
15 0.40430716 89 nips-2011-Estimating time-varying input signals and ion channel states from a single voltage trace of a neuron
16 0.38628548 240 nips-2011-Robust Multi-Class Gaussian Process Classification
17 0.360946 217 nips-2011-Practical Variational Inference for Neural Networks
18 0.33651468 258 nips-2011-Sparse Bayesian Multi-Task Learning
19 0.33522299 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
20 0.32048163 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
topicId topicWeight
[(0, 0.01), (4, 0.019), (20, 0.028), (26, 0.015), (31, 0.112), (33, 0.015), (43, 0.091), (45, 0.08), (57, 0.444), (74, 0.039), (83, 0.024), (99, 0.031)]
simIndex simValue paperId paperTitle
1 0.96657413 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations
Author: Emin Orhan, Robert A. Jacobs
Abstract: Extensive evidence suggests that items are not encoded independently in visual short-term memory (VSTM). However, previous research has not quantitatively considered how the encoding of an item influences the encoding of other items. Here, we model the dependencies among VSTM representations using a multivariate Gaussian distribution with a stimulus-dependent mean and covariance matrix. We report the results of an experiment designed to determine the specific form of the stimulus-dependence of the mean and the covariance matrix. We find that the magnitude of the covariance between the representations of two items is a monotonically decreasing function of the difference between the items’ feature values, similar to a Gaussian process with a distance-dependent, stationary kernel function. We further show that this type of covariance function can be explained as a natural consequence of encoding multiple stimuli in a population of neurons with correlated responses. 1
2 0.95743638 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations
Author: Flavio Chierichetti, David Liben-nowell, Jon M. Kleinberg
Abstract: Motivated by the spread of on-line information in general and on-line petitions in particular, recent research has raised the following combinatorial estimation problem. There is a tree T that we cannot observe directly (representing the structure along which the information has spread), and certain nodes randomly decide to make their copy of the information public. In the case of a petition, the list of names on each public copy of the petition also reveals a path leading back to the root of the tree. What can we conclude about the properties of the tree we observe from these revealed paths, and can we use the structure of the observed tree to estimate the size of the full unobserved tree T ? Here we provide the first algorithm for this size estimation task, together with provable guarantees on its performance. We also establish structural properties of the observed tree, providing the first rigorous explanation for some of the unusual structural phenomena present in the spread of real chain-letter petitions on the Internet. 1
3 0.92348289 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
Author: Jiarong Jiang, Piyush Rai, Hal Daume
Abstract: We consider a general inference setting for discrete probabilistic graphical models where we seek maximum a posteriori (MAP) estimates for a subset of the random variables (max nodes), marginalizing over the rest (sum nodes). We present a hybrid message-passing algorithm to accomplish this. The hybrid algorithm passes a mix of sum and max messages depending on the type of source node (sum or max). We derive our algorithm by showing that it falls out as the solution of a particular relaxation of a variational framework. We further show that the Expectation Maximization algorithm can be seen as an approximation to our algorithm. Experimental results on synthetic and real-world datasets, against several baselines, demonstrate the efficacy of our proposed algorithm. 1
same-paper 4 0.91424119 100 nips-2011-Gaussian Process Training with Input Noise
Author: Andrew Mchutchon, Carl E. Rasmussen
Abstract: In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise variances are inferred from the data as extra hyperparameters. They are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods. 1
5 0.88373184 111 nips-2011-Hashing Algorithms for Large-Scale Learning
Author: Ping Li, Anshumali Shrivastava, Joshua L. Moore, Arnd C. König
Abstract: Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare b-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.
6 0.72629744 171 nips-2011-Metric Learning with Multiple Kernels
7 0.6576848 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm
8 0.65159464 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
9 0.63833266 157 nips-2011-Learning to Search Efficiently in High Dimensions
10 0.63618898 89 nips-2011-Estimating time-varying input signals and ion channel states from a single voltage trace of a neuron
11 0.61336815 226 nips-2011-Projection onto A Nonnegative Max-Heap
12 0.60547823 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
13 0.60047746 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
14 0.59841573 177 nips-2011-Multi-armed bandits on implicit metric spaces
15 0.59836233 219 nips-2011-Predicting response time and error rates in visual search
16 0.59679639 31 nips-2011-An Application of Tree-Structured Expectation Propagation for Channel Decoding
17 0.58847439 24 nips-2011-Active learning of neural response functions with Gaussian processes
18 0.58562768 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories
19 0.58322191 30 nips-2011-Algorithms for Hyper-Parameter Optimization
20 0.56786931 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound