nips nips2003 nips2003-194 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Edward Snelson, Zoubin Ghahramani, Carl E. Rasmussen
Abstract: We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. [sent-6, score-0.201]
2 The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. [sent-8, score-0.128]
3 This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. [sent-9, score-0.13]
4 We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation. [sent-10, score-0.195]
5 Once this is done, GPs can be used as the basis for nonlinear nonparametric regression and classification, showing excellent performance on a wide variety of datasets [1, 2, 3]. [sent-12, score-0.173]
6 This simplicity enables predictions to be made easily using matrix manipulations, and of course the predictive distributions are Gaussian also. [sent-15, score-0.177]
7 Often it is unreasonable to assume that, in the form the data is obtained, the noise will be Gaussian, and the data well modelled as a GP. [sent-16, score-0.109]
8 Then modelling proceeds assuming that this transformed data has Gaussian noise and will be better modelled by the GP. [sent-19, score-0.138]
9 The log is just one particular transformation that could be done; there is a con- tinuum of transformations that could be applied to the observation space to bring the data into a form well modelled by a GP. [sent-20, score-0.317]
10 Making such a transformation should really be a full part of the probabilistic modelling; it seems strange to first make an ad-hoc transformation, and then use a principled Bayesian probabilistic model. [sent-21, score-0.101]
11 In this paper we show how such a transformation or ‘warping’ of the observation space can be made entirely automatically, fully encompassed into the probabilistic framework of the GP. [sent-22, score-0.174]
12 The warped GP makes a transformation from a latent space to the observation, such that the data is best modelled by a GP in the latent space. [sent-23, score-0.731]
13 It can also be viewed as a generalisation of the GP, since in observation space it is a non-Gaussian process, with nonGaussian and asymmetric noise in general. [sent-24, score-0.134]
14 For an excellent review of Gaussian processes for regression and classification see [4]. [sent-26, score-0.109]
15 We show in sections 4 and 5, with both toy and real data, that the warped GP can significantly improve predictive performance over a variety of measures, especially with regard to the whole predictive distribution, rather than just a single point prediction such as the mean or median. [sent-28, score-0.679]
16 2 Nonlinear regression with Gaussian processes Suppose we are given a dataset D, consisting of N pairs of input vectors XN ≡ {x(n) }N n=1 and real-valued targets tN ≡ {tn }N . [sent-30, score-0.18]
17 The covariance between the function value of y at two points x and x is modelled with a covariance function C(x, x ), which is usually assumed to have some simple parametric form. [sent-34, score-0.251]
18 Often the noise model is taken to be input-independent, and the covariance function is taken to be a Gaussian function of the difference in the input vectors (a stationary covariance function), although many other possibilities exist, see e. [sent-36, score-0.215]
19 In this paper we consider only this popular choice, in which case the entries in the covariance matrix are given by D (m) (n) 2 xd − x d 1 + v0 δmn . [sent-39, score-0.106]
20 Cmn = v1 exp − (2) 2 rd d=1 Here rd is a width parameter expressing the scale over which typical functions vary in the dth dimension, v1 is a size parameter expressing the typical size of the overall process in y-space, v0 is the noise variance of the observations, and Θ = {v0 , v1 , r1 , . [sent-40, score-0.164]
21 It is simple to show that the predictive distribution for a new point given the observed data, P (tN +1 |tN , XN +1 ), is Gaussian. [sent-44, score-0.109]
22 The calculation of the mean and variance of this distribution involves doing a matrix inversion of the covariance matrix CN of the training inputs, which using standard exact methods incurs a computational cost of order N 3 . [sent-45, score-0.146]
23 Learning, or ‘training’, in a GP is usually achieved by finding a local maximum in the likelihood using conjugate gradient methods with respect to the hyperparameters Θ of the covariance matrix. [sent-46, score-0.148]
24 The negative log likelihood is given by 1 1 N log 2π . [sent-47, score-0.169]
25 (3) L = − log P (tN |XN , Θ) = log det CN + tN C−1 tN + N 2 2 2 Once again, the evaluation of L, and its gradients with respect to Θ, involve computing the inverse covariance matrix, incurring an order N 3 cost. [sent-48, score-0.285]
26 3 Warping the observation space In this section we present a method of warping the observation space through a nonlinear monotonic function to a latent space, whilst retaining the full probabilistic framework to enable learning and prediction to take place consistently. [sent-50, score-0.618]
27 Let us consider a vector of latent targets zN and suppose that this vector is modelled by a GP, 1 N 1 − log P (zN |XN , Θ) = log det CN + zN C−1 zN + log 2π . [sent-51, score-0.385]
28 (4) N 2 2 2 Now we make a transformation from the true observation space to the latent space by mapping each observation through the same monotonic function f , zn = f (tn ; Ψ) ∀n , (5) where Ψ parameterises the transformation. [sent-52, score-0.427]
29 We require f to be monotonic and mapping on to the whole of the real line; otherwise probability measure will not be conserved in the transformation, and we will not induce a valid distribution over the targets tN . [sent-53, score-0.141]
30 Including the Jacobian term that takes the transformation into account, the negative log likelihood, − log P (tN |XN , Θ, Ψ), now becomes: N L= 3. [sent-54, score-0.231]
31 1 1 ∂f (t) 1 log log det CN + f (tN ) C−1 f (tN ) − N 2 2 ∂t n=1 + tn N log 2π . [sent-55, score-0.452]
32 2 (6) Training the warped GP Learning in this extended model is achieved by simply taking derivatives of the negative log likelihood function (6) with respect to both Θ and Ψ parameter vectors, and using a conjugate gradient method to compute ML parameter values. [sent-56, score-0.587]
33 In this way the form of both the covariance matrix and the nonlinear transformation are learnt simultaneously under the same probabilistic framework. [sent-57, score-0.295]
34 Since the computational limiter to a GP is inverting the covariance matrix, adding a few extra parameters into the likelihood is not really costing us anything. [sent-58, score-0.163]
35 ˆ (7) To find the distribution in the observation space we pass that Gaussian through the nonlinear warping function, giving P (tN +1 |x(N +1) , D, Θ, Ψ) = f (tN +1 ) 2 2πσN +1 exp − 1 2 f (tN +1 ) − zN +1 ˆ σN +1 2 . [sent-63, score-0.464]
36 (8) The shape of this distribution depends on the form of the warping function f , but in general it may be asymmetric and multimodal. [sent-64, score-0.37]
37 If our loss function is absolute error, then the median of the distribution should be predicted, whereas if our loss function is squared error, then it is the mean of the distribution. [sent-66, score-0.143]
38 For a standard GP where the predictive distribution is Gaussian, the median and mean lie at the same point. [sent-67, score-0.21]
39 For the warped GP in general they are at different points. [sent-68, score-0.435]
40 The median is particularly easy to calculate: tmed = f −1 (ˆN +1 ) . [sent-69, score-0.112]
41 z N +1 (9) Notice we need to compute the inverse warping function. [sent-70, score-0.36]
42 For example we may want to know the positions of ‘2σ’ either side of the median so that we can say that approximately 95% of the density lies between these bounds. [sent-74, score-0.115]
43 These points in observation space are calculated in exactly the same way as the median - simply pass the values through the inverse function: tmed±2σ = f −1 (ˆN +1 ± 2σN +1 ) . [sent-75, score-0.195]
44 Rewriting this integral back in latent space we get 2 dzf −1 (z)Nz (ˆN +1 , σN +1 ) = E(f −1 ) . [sent-77, score-0.105]
45 3 Choosing a monotonic warping function We wish to design a warping function that will allow for complex transformations, but we must constrain the function to be monotonic. [sent-80, score-0.706]
46 There are various ways to do this, an obvious one being a neural-net style sum of tanh functions, I f (t; Ψ) = ai tanh (bi (t + ci )) ai , bi ≥ 0 ∀i , (12) i=1 where Ψ = {a, b, c}. [sent-81, score-0.202]
47 The dotted lines show the true generating distribution, the dashed lines show a GP’s predictions, and the solid lines show the warped GP’s predictions. [sent-92, score-0.522]
48 (a) The triplets of lines represent the median, and 2σ percentiles in each case. [sent-93, score-0.104]
49 The derivatives of this function with respect to either t, or the warping parameters Ψ, are easy to compute. [sent-97, score-0.358]
50 As explained earlier, this will not lead to a proper density in t space, because the density in z space is Gaussian, which covers the whole of the real line. [sent-100, score-0.139]
51 We can fix this up by using instead: I ai tanh (bi (t + ci )) f (t; Ψ) = t + a i , bi ≥ 0 ∀i . [sent-101, score-0.106]
52 In doing so, we have restricted ourselves to only making warping functions with f ≥ 1, but because the size of the covariance function v1 is free to vary, the effective gradient can be made arbitrarily small by simply making the range of the data in the latent space arbitrarily big. [sent-103, score-0.571]
53 A more flexible system of linear trends may be made by including, in addition to the neural1 net style function (12), some functions of the form β log eβm1 (t−d) + eβm2 (t−d) , where m1 , m2 ≥ 0. [sent-104, score-0.151]
54 4 A simple 1D regression task A simple 1D regression task was created to show a situation where the warped GP should, and does, perform significantly better than the standard GP. [sent-107, score-0.581]
55 101 points, regularly spaced from −π to π on the x axis, were generated with Gaussian noise about a sine function. [sent-108, score-0.097]
56 These points were then warped through the function t = z 1/3 , to arrive at the dataset t which is shown as the dots in Figure 1(a). [sent-109, score-0.488]
57 (a) sine z (c) abalone (b) creep z z t t (d) ailerons z t t Figure 2: Warping functions learnt for the four regression tasks carried out in this paper. [sent-110, score-0.553]
58 Each plot is made over the range of the observation data, from tmin to tmax . [sent-111, score-0.182]
59 A GP and a warped GP were trained independently on this dataset using a conjugate gradient minimisation procedure and randomly initialised parameters, to obtain maximum likelihood parameters. [sent-112, score-0.511]
60 For the warped GP, the warping function (13) was used with just two tanh functions. [sent-113, score-0.842]
61 For both models the covariance matrix (2) was used. [sent-114, score-0.106]
62 Hybrid Monte Carlo was also implemented to integrate over all the parameters, or just the warping parameters (much faster since no matrix inversion is required with each step), but with this dataset (and the real datasets of section 5) no significant differences were found from ML. [sent-115, score-0.46]
63 Predictions from the GP and warped GP were made, using the ML parameters, for 401 points regularly spaced over the range of x. [sent-116, score-0.498]
64 The predictions made were the median and 2σ percentiles in each case, and these are plotted as triplets of lines on Figure 1(a). [sent-117, score-0.247]
65 The predictions from the warped GP are found to be much closer to the true generating distribution than the standard GP, especially with regard to the 2σ lines. [sent-118, score-0.494]
66 The mean line was also computed, and found to lie close, but slightly skewed, from the median line. [sent-119, score-0.101]
67 Figure 1(b) emphasises the point that the warped GP finds the shape of the whole predictive distribution much better, not just the median or mean. [sent-120, score-0.649]
68 In this plot, one particular point on the x axis is chosen, x = −π/4, and the predictive densities from the GP and warped GP are plotted alongside the true density (which can be written down analytically). [sent-121, score-0.584]
69 Figure 2(a) shows the warping function learnt for this regression task. [sent-123, score-0.467]
70 The tanh functions have adjusted themselves so that they mimic a t3 nonlinearity over the range of the observation space, thus inverting the z 1/3 transformation imposed when generating the data. [sent-124, score-0.265]
71 5 Results for some real datasets It is not surprising that the method works well on the toy dataset of section 4 since it was generated from a known nonlinear warping of a smooth function with Gaussian noise. [sent-125, score-0.501]
72 To demonstrate that nonlinear transformations also help on real data sets we have run the warped GP comparing its predictions to an ordinary GP on three regression problems. [sent-126, score-0.651]
73 These datasets are summarised in the following table which shows the range of the targets (tmin , tmax ), the number of input dimensions (D), and the size of the training and test sets (Ntrain , Ntest ) that we used. [sent-127, score-0.158]
74 Dataset creep abalone ailerons D 30 8 40 tmin 18 MPa 1 yr −3. [sent-128, score-0.417]
75 5 × 10−4 Ntrain 800 1000 1000 Ntest 1266 3177 6154 Dataset creep abalone ailerons Model GP GP + log warped GP GP GP + log warped GP GP warped GP Absolute error 16. [sent-130, score-1.802]
76 45 Table 1: Results of testing the GP, warped GP, and GP with log transform, on three real datasets. [sent-151, score-0.518]
77 The dataset creep is a materials science set, with the objective to predict creep rupture stress (in MPa) for steel given chemical composition and other inputs [7, 8]. [sent-153, score-0.452]
78 With abalone the aim is to predict the the age of abalone from various physical inputs [9]. [sent-154, score-0.262]
79 ailerons is a simulated control problem, with the aim to predict the control action on the ailerons of an F16 aircraft [10, 11]. [sent-155, score-0.228]
80 For datasets creep and abalone, which consist of positive observations only, standard practice may be to model the log of the data with a GP. [sent-156, score-0.297]
81 So for these datasets we have compared three models: a GP directly on the data, a GP on the fixed log-transformed data, and the warped GP directly on the data. [sent-157, score-0.49]
82 The predictive points and densities were always compared in the original data space, accounting for the Jacobian of both the log and the warped transforms. [sent-158, score-0.634]
83 The models were run as in the 1D task: ML parameter estimates only, covariance matrix (2), and warping function (13) with three tanh functions. [sent-159, score-0.513]
84 We show three measures of performance over independent test sets: mean absolute error, mean squared error, and the mean negative log predictive density evaluated at the test points. [sent-161, score-0.256]
85 On these three sets, the warped GP always performs significantly better than the standard GP. [sent-163, score-0.435]
86 For creep and abalone, the fixed log transform clearly works well too, but particularly in the case of creep, the warped GP learns a better transformation. [sent-164, score-0.65]
87 Figure 2 shows the warping functions learnt, and indeed 2(b) and 2(c) are clearly log-like in character. [sent-165, score-0.356]
88 On the other hand 2(d), for the ailerons set, is exponential-like. [sent-166, score-0.1]
89 This shows the warped GP is able to flexibly handle these different types of datasets. [sent-167, score-0.435]
90 The shapes of the learnt warping functions were also found to be very robust to random initialisation of the parameters. [sent-168, score-0.417]
91 Finally, the warped GP also makes a better job of predicting the distributions, as shown by the difference in values of the negative log density. [sent-169, score-0.518]
92 6 Conclusions, extensions, and related work We have shown that the warped GP is a useful extension to the standard GP for regression, capable of finding extra structure in the data through the transformations it learns. [sent-170, score-0.495]
93 Of course some datasets are well modelled by a GP already, and applying the warped GP model simply results in a linear “warping” function. [sent-173, score-0.552]
94 many observations at the edge of the range lie on a single point, cause the warped GP problems. [sent-176, score-0.505]
95 The warping function attempts to model the censoring by pushing those points far away from the rest of the data, and it suffers in performance especially for ML learning. [sent-177, score-0.354]
96 As a further extension, one might consider warping the input space in some nonlinear fashion. [sent-179, score-0.401]
97 In the context of geostatistics this has actually been dealt with by O’Hagan [12], where a transformation is made from an input space which can have non-stationary and non-isotropic covariance structure, to a latent space in which the usual conditions of stationarity and isotropy hold. [sent-180, score-0.3]
98 Gaussian process classifiers can also be thought of as warping the outputs of a GP, through a mapping onto the (0, 1) probability interval. [sent-181, score-0.333]
99 However, the observations in classification are discrete, not points in this warped continuous space. [sent-182, score-0.483]
100 Many thanks to David MacKay for useful discussions, suggestions of warping functions and datasets to try. [sent-194, score-0.411]
wordName wordTfidf (topN-words)
[('gp', 0.653), ('warped', 0.435), ('warping', 0.333), ('tn', 0.232), ('creep', 0.15), ('abalone', 0.117), ('ailerons', 0.1), ('predictive', 0.091), ('covariance', 0.084), ('transformation', 0.083), ('median', 0.079), ('zn', 0.075), ('tanh', 0.074), ('regression', 0.073), ('log', 0.065), ('latent', 0.064), ('modelled', 0.062), ('learnt', 0.061), ('ml', 0.057), ('datasets', 0.055), ('gaussian', 0.055), ('mpa', 0.05), ('percentiles', 0.05), ('tmin', 0.05), ('noise', 0.047), ('observation', 0.045), ('nonlinear', 0.045), ('tmax', 0.043), ('predictions', 0.041), ('monotonic', 0.04), ('targets', 0.039), ('transformations', 0.039), ('density', 0.036), ('processes', 0.036), ('cn', 0.035), ('xn', 0.034), ('diggle', 0.033), ('hagan', 0.033), ('ntrain', 0.033), ('rupture', 0.033), ('steel', 0.033), ('tmed', 0.033), ('gps', 0.033), ('bi', 0.032), ('dataset', 0.032), ('mn', 0.032), ('hybrid', 0.03), ('modelling', 0.029), ('lines', 0.029), ('cmn', 0.029), ('sine', 0.029), ('parameterises', 0.029), ('ntest', 0.029), ('predict', 0.028), ('inverse', 0.027), ('observations', 0.027), ('materials', 0.026), ('edward', 0.026), ('monte', 0.026), ('carlo', 0.026), ('whole', 0.026), ('absolute', 0.025), ('det', 0.025), ('derivatives', 0.025), ('triplets', 0.025), ('jacobian', 0.025), ('rd', 0.024), ('space', 0.023), ('functions', 0.023), ('expressing', 0.023), ('made', 0.023), ('conjugate', 0.023), ('densities', 0.022), ('carl', 0.022), ('style', 0.022), ('lie', 0.022), ('matrix', 0.022), ('regular', 0.021), ('phd', 0.021), ('squared', 0.021), ('regularly', 0.021), ('christopher', 0.021), ('likelihood', 0.021), ('extra', 0.021), ('points', 0.021), ('range', 0.021), ('hyperparameters', 0.02), ('gradients', 0.019), ('asymmetric', 0.019), ('inverting', 0.019), ('bayesian', 0.019), ('map', 0.019), ('distribution', 0.018), ('integral', 0.018), ('thesis', 0.018), ('real', 0.018), ('trends', 0.018), ('really', 0.018), ('negative', 0.018), ('toy', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 194 nips-2003-Warped Gaussian Processes
Author: Edward Snelson, Zoubin Ghahramani, Carl E. Rasmussen
Abstract: We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation. 1
2 0.39119184 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression
Author: Christopher J. Paciorek, Mark J. Schervish
Abstract: We introduce a class of nonstationary covariance functions for Gaussian process (GP) regression. Nonstationary covariance functions allow the model to adapt to functions whose smoothness varies with the inputs. The class includes a nonstationary version of the Matérn stationary covariance, in which the differentiability of the regression function is controlled by a parameter, freeing one from fixing the differentiability in advance. In experiments, the nonstationary GP regression model performs well when the input space is two or three dimensions, outperforming a neural network model and Bayesian free-knot spline models, and competitive with a Bayesian neural network, but is outperformed in one dimension by a state-of-the-art Bayesian free-knot spline model. The model readily generalizes to non-Gaussian data. Use of computational methods for speeding GP fitting may allow for implementation of the method on larger datasets. 1
3 0.3603428 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
4 0.11667972 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks
Author: Anton Schwaighofer, Marian Grigoras, Volker Tresp, Clemens Hoffmann
Abstract: In this article, we present a novel approach to solving the localization problem in cellular networks. The goal is to estimate a mobile user’s position, based on measurements of the signal strengths received from network base stations. Our solution works by building Gaussian process models for the distribution of signal strengths, as obtained in a series of calibration measurements. In the localization stage, the user’s position can be estimated by maximizing the likelihood of received signal strengths with respect to the position. We investigate the accuracy of the proposed approach on data obtained within a large indoor cellular network. 1
5 0.089494355 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
Author: Clayton Scott, Robert Nowak
Abstract: This paper reports on a family of computationally practical classifiers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed. 1
6 0.083074354 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
7 0.065714382 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl
8 0.065608412 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
9 0.060816042 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
10 0.058026813 115 nips-2003-Linear Dependent Dimensionality Reduction
11 0.047768287 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
12 0.041863229 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
13 0.040590353 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
14 0.038022738 170 nips-2003-Self-calibrating Probability Forecasting
15 0.037514068 126 nips-2003-Measure Based Regularization
16 0.037320971 155 nips-2003-Perspectives on Sparse Bayesian Learning
17 0.037121281 176 nips-2003-Sequential Bayesian Kernel Regression
18 0.03614391 11 nips-2003-A Mixed-Signal VLSI for Real-Time Generation of Edge-Based Image Vectors
19 0.034162298 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach
20 0.033930581 107 nips-2003-Learning Spectral Clustering
topicId topicWeight
[(0, -0.161), (1, 0.037), (2, -0.021), (3, 0.004), (4, -0.007), (5, 0.208), (6, 0.086), (7, -0.249), (8, 0.21), (9, 0.203), (10, -0.315), (11, 0.315), (12, 0.01), (13, -0.217), (14, 0.026), (15, -0.155), (16, -0.16), (17, 0.09), (18, -0.073), (19, -0.047), (20, -0.025), (21, -0.128), (22, -0.079), (23, -0.042), (24, 0.106), (25, 0.051), (26, -0.055), (27, 0.009), (28, 0.037), (29, 0.105), (30, 0.068), (31, 0.036), (32, 0.047), (33, -0.061), (34, 0.025), (35, -0.102), (36, 0.008), (37, -0.042), (38, -0.028), (39, 0.03), (40, -0.015), (41, 0.083), (42, 0.003), (43, -0.043), (44, 0.02), (45, -0.011), (46, 0.029), (47, -0.057), (48, 0.031), (49, -0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.96020085 194 nips-2003-Warped Gaussian Processes
Author: Edward Snelson, Zoubin Ghahramani, Carl E. Rasmussen
Abstract: We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation. 1
2 0.86105496 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression
Author: Christopher J. Paciorek, Mark J. Schervish
Abstract: We introduce a class of nonstationary covariance functions for Gaussian process (GP) regression. Nonstationary covariance functions allow the model to adapt to functions whose smoothness varies with the inputs. The class includes a nonstationary version of the Matérn stationary covariance, in which the differentiability of the regression function is controlled by a parameter, freeing one from fixing the differentiability in advance. In experiments, the nonstationary GP regression model performs well when the input space is two or three dimensions, outperforming a neural network model and Bayesian free-knot spline models, and competitive with a Bayesian neural network, but is outperformed in one dimension by a state-of-the-art Bayesian free-knot spline model. The model readily generalizes to non-Gaussian data. Use of computational methods for speeding GP fitting may allow for implementation of the method on larger datasets. 1
3 0.62227559 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
4 0.60725665 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks
Author: Anton Schwaighofer, Marian Grigoras, Volker Tresp, Clemens Hoffmann
Abstract: In this article, we present a novel approach to solving the localization problem in cellular networks. The goal is to estimate a mobile user’s position, based on measurements of the signal strengths received from network base stations. Our solution works by building Gaussian process models for the distribution of signal strengths, as obtained in a series of calibration measurements. In the localization stage, the user’s position can be estimated by maximizing the likelihood of received signal strengths with respect to the position. We investigate the accuracy of the proposed approach on data obtained within a large indoor cellular network. 1
5 0.2513763 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
Author: Jan Eichhorn, Andreas Tolias, Alexander Zien, Malte Kuss, Jason Weston, Nikos Logothetis, Bernhard Schölkopf, Carl E. Rasmussen
Abstract: We report and compare the performance of different learning algorithms based on data from cortical recordings. The task is to predict the orientation of visual stimuli from the activity of a population of simultaneously recorded neurons. We compare several ways of improving the coding of the input (i.e., the spike data) as well as of the output (i.e., the orientation), and report the results obtained using different kernel algorithms. 1
6 0.22088781 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
7 0.20304435 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning
8 0.1997209 15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl
9 0.19734776 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
10 0.19724287 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
11 0.19610333 176 nips-2003-Sequential Bayesian Kernel Regression
12 0.1776239 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
13 0.17464401 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
14 0.17048514 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
15 0.16654335 115 nips-2003-Linear Dependent Dimensionality Reduction
16 0.16641316 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
17 0.16556226 92 nips-2003-Information Bottleneck for Gaussian Variables
18 0.16443697 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
19 0.16046154 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
20 0.16031161 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
topicId topicWeight
[(0, 0.029), (11, 0.024), (29, 0.017), (30, 0.015), (33, 0.012), (35, 0.055), (53, 0.106), (69, 0.031), (71, 0.059), (76, 0.062), (85, 0.069), (91, 0.08), (93, 0.315), (99, 0.015)]
simIndex simValue paperId paperTitle
1 0.81370503 119 nips-2003-Local Phase Coherence and the Perception of Blur
Author: Zhou Wang, Eero P. Simoncelli
Abstract: unkown-abstract
same-paper 2 0.77197868 194 nips-2003-Warped Gaussian Processes
Author: Edward Snelson, Zoubin Ghahramani, Carl E. Rasmussen
Abstract: We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation. 1
3 0.66579503 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
Author: Claudio Gentile
Abstract: New feature selection algorithms for linear threshold functions are described which combine backward elimination with an adaptive regularization method. This makes them particularly suitable to the classification of microarray expression data, where the goal is to obtain accurate rules depending on few genes only. Our algorithms are fast and easy to implement, since they center on an incremental (large margin) algorithm which allows us to avoid linear, quadratic or higher-order programming methods. We report on preliminary experiments with five known DNA microarray datasets. These experiments suggest that multiplicative large margin algorithms tend to outperform additive algorithms (such as SVM) on feature selection tasks. 1
4 0.49350563 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
5 0.49117297 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
Author: Tong Zhang
Abstract: In this paper we obtain convergence bounds for the concentration of Bayesian posterior distributions (around the true distribution) using a novel method that simplifies and enhances previous results. Based on the analysis, we also introduce a generalized family of Bayesian posteriors, and show that the convergence behavior of these generalized posteriors is completely determined by the local prior structure around the true distribution. This important and surprising robustness property does not hold for the standard Bayesian posterior in that it may not concentrate when there exist “bad” prior structures even at places far away from the true distribution. 1
6 0.49031547 113 nips-2003-Learning with Local and Global Consistency
7 0.49006581 78 nips-2003-Gaussian Processes in Reinforcement Learning
8 0.48938122 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
9 0.48850858 107 nips-2003-Learning Spectral Clustering
10 0.488352 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
11 0.48767373 172 nips-2003-Semi-Supervised Learning with Trees
12 0.48761898 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
13 0.48637885 189 nips-2003-Tree-structured Approximations by Expectation Propagation
14 0.48633304 112 nips-2003-Learning to Find Pre-Images
15 0.48628414 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
16 0.48413354 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
17 0.48389572 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
18 0.48372751 115 nips-2003-Linear Dependent Dimensionality Reduction
19 0.48360908 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
20 0.48300126 30 nips-2003-Approximability of Probability Distributions