nips nips2005 nips2005-16 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sathiya Keerthi, Wei Chu
Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. [sent-8, score-0.506]
2 The advantage of Gaussian process (GP) models over non-Bayesian kernel methods, such as support vector machines, comes from the explicit probabilistic formulation that yields predictive distributions for test instances and allows standard Bayesian techniques for model selection. [sent-11, score-0.274]
3 The cost of training GP models is O(n3 ) where n is the number of training instances, which results in a huge computational cost for large data sets. [sent-12, score-0.296]
4 Furthermore, when predicting a test case, a GP model requires O(n) cost for computing the mean and O(n2 ) cost for computing the variance. [sent-13, score-0.206]
5 Recently, sparse GP models which bring down the complexity of training as well as testing have attracted considerable attention. [sent-15, score-0.213]
6 Smola and Bartlett (2001) proposed a forward selection scheme to approximate the log posterior probability. [sent-18, score-0.251]
7 (2003) presented a very fast greedy selection method for building sparse GP regression models. [sent-21, score-0.336]
8 All of these methods make efforts to select an informative subset of the training instances for the predictive model. [sent-22, score-0.269]
9 This subset is usually referred to as the set of basis vectors, denoted as I. [sent-23, score-0.16]
10 The maximal size of I is usually limited by a value dmax . [sent-24, score-0.586]
11 As dmax n, the sparseness greatly alleviates the computational burden in both training and prediction of the GP models. [sent-25, score-0.649]
12 The performance of the resulting sparse GP models crucially depends on the criterion used in the basis vector selection. [sent-26, score-0.331]
13 Motivated by the ideas of Matching Pursuit (Vincent and Bengio, 2002), we propose a new criterion of greedy forward selection for sparse GP models. [sent-27, score-0.375]
14 , f (xn )]T and K is the n × n covariance matrix whose ij-th element is K(xi , xj ), K being the kernel function. [sent-42, score-0.155]
15 The computational cost of training is O(n3 ), which mainly comes from the need to invert the matrix (K + σ2 I) and obtain the vector α . [sent-52, score-0.148]
16 For doing predictions of a test instance the cost is O(n) to compute the mean and O(n2 ) for computing the variance. [sent-53, score-0.163]
17 (2003) gave a neat method for working with a reduced number of latent variables, laying the foundation for forming sparse GP models. [sent-57, score-0.208]
18 Instead of assuming n latent variables for all the training instances, sparse GP models assume only d latent variables placed at some chosen basis vectors {˜ i }, denoted as a column vector f I = [f (˜ 1 ), . [sent-59, score-0.552]
19 , P(f I ) = N (f I ; 0, KI ) (2) where KI is the d × d covariance matrix of the basis vectors whose ij-th element is K(˜ i , xj ). [sent-65, score-0.296]
20 x ˜ These latent variables are then projected to all the training instances. [sent-66, score-0.166]
21 Under the imposed joint Gaussian prior, the conditional mean at the training instances is KT K−1 f I , where I,· I KI,· is a d × n matrix of the covariance functions between the basis vectors and all the training instances. [sent-67, score-0.465]
22 The likelihood can be evaluated by these projected latent variables as follows P(y|f I ) = N (y; KT K−1 f I , σ 2 I) (3) I,· I The posterior is P(f I |y) = N (f I ; KI αI , σ 2 KI (σ 2 KI + KI,· KT )−1 KI ), where I,· αI = (σ 2 KI + KI,· KT )−1 KI,· y. [sent-68, score-0.144]
23 The predictive distribution at any test instance x is I,· ˜T ˜T ˜T ˜T ˜ ˜2 N (f (x); µx , σx ), where µx = k αI , σx = K(x, x) − k K−1 k + σ 2 k (σ 2 KI + ˜ ˜2 I ˜ ˜ KI,· KT )−1 k, and k is a column vector of the covariance functions between the basis I,· ˜ x vectors and the test instance x, i. [sent-69, score-0.45]
24 x While the cost of training the full GP model is O(n3 ), the training complexity of sparse GP models is only O(nd2 ). [sent-75, score-0.393]
25 Thus, if dmax is not big, learning on large datasets is feasible via I,· sparse GP models. [sent-77, score-0.73]
26 Also, for these sparse models, prediction for each test instance costs O(dmax ) for the mean and O(d2 ) for the variance. [sent-78, score-0.232]
27 max Generally the basis vectors can be placed anywhere in the input space Rm . [sent-79, score-0.243]
28 Since training instances usually cover the input space of interest quite well, it is quite reasonable to select basis vectors from just the set of training instances. [sent-80, score-0.477]
29 For a given problem dmax is chosen to be as large as possible subject to constraints on computational time in training and/or testing. [sent-81, score-0.649]
30 Then we use some basis selection method to find I of size dmax . [sent-82, score-0.889]
31 This viewpoint helps in the selection of basis vectors. [sent-86, score-0.294]
32 In other words, the basis vectors of the sparse GPs can be selected to minimize the negative log-posterior of the full GPs, π(α) defined as in (4). [sent-96, score-0.384]
33 3 Selection of basis functions The most crucial element of the sparse GP approach of the previous section is the choice of I, the set of basis vectors, which we take to be a subset of the training vectors. [sent-97, score-0.528]
34 The cheapest method is to select the basis vectors at random from the training data set. [sent-98, score-0.335]
35 But, such a choice will not work well when dmax is much smaller than n. [sent-99, score-0.586]
36 A principled approach is to select I that makes the corresponding sparse GP approximate well, the posterior distribution of the full GP. [sent-100, score-0.225]
37 It would be ideal to choose, among all subsets, I of size dmax , the one that gives the best value of π in (5). [sent-102, score-0.586]
38 (1) There is a basic cost associated with updating of the sparse GP solution, given a sequence of chosen basis functions. [sent-108, score-0.362]
39 This cost is the same for all forward selection methods, and is O(nd2 ). [sent-110, score-0.25]
40 (2) Then, depending on the basis selection method, max there is the cost associated with basis selection. [sent-111, score-0.554]
41 We will refer to the accumulated value of this cost for choosing all dmax basis functions as Tselection . [sent-112, score-0.851]
42 Forward basis selection methods differ in the way they choose effective basis functions while keeping Tselection small. [sent-113, score-0.434]
43 It is useful to note that the total cost associated with the random basis selection method mentioned earlier is just Tbasic = O(nd2 ). [sent-114, score-0.388]
44 Consider the typical situation in forward selection where we have a current working set I and we are interested in choosing the next basis vector, xi . [sent-117, score-0.402]
45 Thus, their selection criterion for the instance xi ∈ I is the decrease in π(α) that can be / obtained by allowing both αI and αi as variables to be non-zero. [sent-121, score-0.307]
46 Accumulated till dmax basis functions are added, this leads to a Tselection that has O(n2 d2 ) complexity, which is disproportionately higher than Tbasic . [sent-125, score-0.814]
47 max Therefore, Smola and Bartlett (2001) resorted to a randomized scheme by considering only κ basis elements randomly chosen from outside I during one basis selection. [sent-126, score-0.452]
48 max Although, from a complexity viewpoint, Tbasic and Tselection are same, it should be noted that the overall cost of the method is about 60 times that of Tbasic . [sent-129, score-0.182]
49 (2003) proposed a novel and very cheap heuristic criterion for basis selection. [sent-132, score-0.214]
50 The “informativeness” of an input vector xi ∈ / I is scored by the information gain between the true posterior distribution, P(f I |y) and a posterior approximation, Q(f I |y), where I denotes the new set of basis vectors after including a new element xi into the current set I. [sent-133, score-0.508]
51 Thus, the scores of all instances outside I can be efficiently evaluated in O(n) time, which makes this algorithm almost as fast as using random selection! [sent-136, score-0.162]
52 The potential weakness of this algorithm might be the non-use of the correlation in the remaining instances {xi : xi ∈ I}. [sent-137, score-0.185]
53 Kernel Matching Pursuit (Vincent and Bengio, 2002) is a sparse method for ordinary least squares that consists of two general greedy sparse approximation schemes, called prebackfitting and post-backfitting. [sent-142, score-0.302]
54 Both methods can be generalized to select the basis vectors for sparse GPs. [sent-145, score-0.36]
55 Our method is an efficient selection criterion that is based on the post-backfitting idea. [sent-147, score-0.197]
56 ˜ The minimizer, denoted as αI , is given by αI = (σ 2 KI + KI,· KT )−1 KI,· y I,· (6) / Our scoring criterion for an instance xi ∈ I is based on optimizing π(α) by fixing αI = αI and changing αi only. [sent-149, score-0.173]
57 The one-dimensional minimizer can be easily found as ˜ KT (y − KT αI ) − σ 2 ki αI i,· I,· T ∗ αi = (7) σ 2 K(xi , xi ) + KT Ki,· i,· where Ki,· is the n × 1 matrix of covariance functions between xi and all the training data, ˜ and ki is a d dimensional vector having K(xj , xi ), xj ∈ I. [sent-150, score-0.944]
58 The selection score of the instance xi is the decrease in π(α) achieved by the one dimensional optimization of αi , which can be written in closed form as 1 ∗ ∆i = (αi )2 σ 2 K(xi , xi ) + KT Ki,· (8) i,· 2 ∗ where αi is defined as in (7). [sent-151, score-0.337]
59 Summing the cost till dmax basis vectors are selected, we get an overall complexity of O(n2 dmax ), which is much higher than Tbasic . [sent-156, score-1.532]
60 Since it costs only O(n) time to evaluate our selection criterion in (8) for one instance, we can choose the next basis vector from a set of dmax instances randomly selected from outside of I. [sent-158, score-1.16]
61 But, from a practical point of view max the scheme is expensive because the selection criterion (8) requires computing a full kernel row Ki,· for each instance to be evaluated. [sent-160, score-0.389]
62 As kernel evaluations could be very expensive, we propose a modified scheme to keep the number of such evaluations small. [sent-161, score-0.19]
63 At the beginning of the algorithm (when I is empty) we initialize C by randomly choosing c training instances, computing the full kernel row, Ki,· for the chosen i’s and putting them in the rows of C. [sent-163, score-0.224]
64 Each step corresponding to a new basis vector selection proceeds as follows. [sent-164, score-0.274]
65 First we compute ∆i for the c instances corresponding to the rows of C and select the instance with the highest score for inclusion in I. [sent-165, score-0.268]
66 Finally, we randomly select κ fresh instances (from outside of I and the vectors that define C) to replace xj and the κ − 1 cached instances with the lowest score. [sent-168, score-0.417]
67 Thus, in each basis selection step, we compute the criterion scores for c instances, but evaluate full kernel rows only for κ fresh instances. [sent-169, score-0.542]
68 An important advantage of the above scheme is that, those basis elements which have very good scores, but are overtaken by another better element in a particular step, continue to remain in C and probably get to be selected in future basis selection steps. [sent-170, score-0.534]
69 The value of c can be set to be any integer between κ and dmax . [sent-172, score-0.586]
70 The above cache scheme is special to max our method and cannot be used with Smola and Bartlett’s method without unduly increasing its complexity. [sent-174, score-0.218]
71 If available, it is also useful to have an extra cache for storing kernel rows of instances which get discarded in one step, but which get to be considered again in a future step. [sent-175, score-0.29]
72 4 Model adaptation In this section we address the problem of model adaptation for a given number of basis functions, dmax . [sent-177, score-0.876]
73 Since the same ideas hold for all basis selection methods, we will not discuss them in detail. [sent-180, score-0.274]
74 The sparse GP model is conditional on the parameters in the kernel function and the Gaussian noise level σ 2 , which can all be collected together in θ, the hyperparameter vector. [sent-181, score-0.184]
75 This problem can be handled by repeating the following alternating steps: (1) fix θ and select I by the given basis selection algorithm; and (2) fix I and do a (short) gradient based adaptation of θ. [sent-184, score-0.374]
76 For the cache-based post-backfitting method of basis selection we also do the following for adding some stability to the model adaptation process. [sent-185, score-0.368]
77 5 Numerical experiments In this section, we compare our method against other sparse GP methods to verify the usefulness of our algorithm. [sent-187, score-0.146]
78 For all experiments, we use the ARD Gaussian kernel defined by m 2 K(xi , xj ) = υ0 exp + υb where υ0 , υ , υb > 0 and xi denotes the =1 υ (xi − xj ) -th element of xi . [sent-190, score-0.323]
79 We use the KIN40K data set,1 composed of 40,000 samples, to evaluate and compare the performance of the various basis selection criteria. [sent-193, score-0.296]
80 We first trained a full GPR model with the ARD Gaussian kernel on a subset of 2000 samples randomly selected in the dataset. [sent-194, score-0.174]
81 We compare the following five basis selection methods: 1. [sent-196, score-0.274]
82 our algorithm described in Section 3 with cache size c = κ = 59 (KAPPA) in which we evaluate the selection scores of κ instances at each step; 4. [sent-200, score-0.353]
83 our algorithm described in Section 3 with cache size c = dmax (DMAX); 5. [sent-201, score-0.666]
84 From the upper plot of Figure 1 we can see that INFO yields much worse NMSE results than KAPPA, DMAX and SB, when dmax is less than 600. [sent-207, score-0.609]
85 Interestingly, DMAX is even slightly better than SB when dmax is less than 200. [sent-210, score-0.586]
86 This is probably because DMAX has a bigger set of basis functions to choose from, than SB. [sent-211, score-0.16]
87 The lower plot of Figure 1 gives the CPU time consumed by the five algorithms for training, as a function of dmax , in log − log scale. [sent-215, score-0.586]
88 8 5 10 4 CPU TIME 10 3 10 2 10 1 10 SB DMAX KAPPA INFO 0 10 100 RAND 200 300 500 1000 2000 Figure 1: The variations of test set NMSE, test set NLPD and CPU time (in seconds) for training of the five algorithms as a function of dmax . [sent-231, score-0.721]
89 In the NMSE and NLPD plots, at each value of dmax , the results of the five algorithms are presented as a boxplot group. [sent-232, score-0.586]
90 2 As dmax increases, the cost of KAPPA asymptotically gets close to INFO. [sent-241, score-0.697]
91 The gap between DMAX and KAPPA is the O(nd2 max − κndmax ) cost in computing the score (8) for the additional (dmax − κ) instances. [sent-242, score-0.147]
92 Thus, as dmax increases, the curve of DMAX asymptotically becomes parallel to the curve of INFO. [sent-243, score-0.612]
93 Next, we compare model adaptation abilities of the following three algorithms for dmax = 500. [sent-248, score-0.651]
94 The values of these hyperparameters were obtained by training via a standard full GPR model on a manageable subset of 2000 samples randomly selected from the training data. [sent-251, score-0.258]
95 The model adaptation scheme is coupled with the INFO basis selection algorithm (ADAPT- INFO). [sent-254, score-0.384]
96 The model adaptation scheme is coupled with our DMAX basis selection algorithm (ADAPT- DMAX). [sent-256, score-0.384]
97 If we want to take kernel evaluations also into account, the cost of KAPPA is O(mκndmax ) where m is the number of input variables. [sent-257, score-0.191]
98 Note that INFO does not require any kernel evaluations for computing its selection criterion. [sent-258, score-0.22]
99 d denotes the number of input features, ntrg denotes the training data size and ntst denotes the test data size. [sent-261, score-0.167]
100 Fast forward selection to speed up sparse Gaussian process regression. [sent-385, score-0.282]
wordName wordTfidf (topN-words)
[('dmax', 0.586), ('ki', 0.295), ('gp', 0.217), ('info', 0.211), ('kt', 0.2), ('nlpd', 0.185), ('tselection', 0.168), ('sb', 0.167), ('kappa', 0.161), ('nmse', 0.161), ('basis', 0.16), ('seeger', 0.153), ('tbasic', 0.134), ('bartlett', 0.118), ('sparse', 0.117), ('selection', 0.114), ('smola', 0.11), ('instances', 0.108), ('cost', 0.085), ('cache', 0.08), ('xi', 0.077), ('rand', 0.073), ('kernel', 0.067), ('adaptation', 0.065), ('predictive', 0.063), ('training', 0.063), ('latent', 0.062), ('criterion', 0.054), ('forward', 0.051), ('ndmax', 0.05), ('vectors', 0.048), ('gps', 0.048), ('scheme', 0.045), ('gpr', 0.044), ('fixed', 0.043), ('instance', 0.042), ('posterior', 0.041), ('cpu', 0.039), ('evaluations', 0.039), ('greedy', 0.039), ('ard', 0.037), ('regression', 0.037), ('costs', 0.037), ('xj', 0.037), ('test', 0.036), ('gain', 0.036), ('gaussian', 0.036), ('vincent', 0.035), ('max', 0.035), ('select', 0.035), ('rows', 0.035), ('et', 0.034), ('adler', 0.034), ('disproportionately', 0.034), ('ntrg', 0.034), ('ntst', 0.034), ('till', 0.034), ('complexity', 0.033), ('full', 0.032), ('pursuit', 0.03), ('fresh', 0.029), ('scores', 0.029), ('seven', 0.029), ('method', 0.029), ('matching', 0.028), ('element', 0.028), ('datasets', 0.027), ('randomly', 0.027), ('selected', 0.027), ('al', 0.027), ('logarithm', 0.027), ('slower', 0.027), ('performances', 0.027), ('candela', 0.027), ('bank', 0.027), ('csat', 0.027), ('score', 0.027), ('asymptotically', 0.026), ('tting', 0.026), ('outside', 0.025), ('heavy', 0.025), ('nystr', 0.025), ('tresp', 0.025), ('hyperparameters', 0.025), ('ve', 0.024), ('williams', 0.024), ('promising', 0.024), ('house', 0.023), ('worse', 0.023), ('covariance', 0.023), ('evaluate', 0.022), ('pointed', 0.021), ('inclusion', 0.021), ('projected', 0.021), ('samples', 0.021), ('accumulated', 0.02), ('viewpoint', 0.02), ('dent', 0.02), ('variables', 0.02), ('leen', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
Author: Sathiya Keerthi, Wei Chu
Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1
2 0.2108234 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
Author: Edward Snelson, Zoubin Ghahramani
Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M 2 N ) training cost and O(M 2 ) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M , i.e. very sparse solutions, and it significantly outperforms other approaches in this regime. 1
3 0.152514 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
Author: Edward Meeds, Simon Osindero
Abstract: We present an infinite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani [1]; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian specification of the effective ‘gating network’ for the different experts. 1
4 0.10647003 69 nips-2005-Fast Gaussian Process Regression using KD-Trees
Author: Yirong Shen, Matthias Seeger, Andrew Y. Ng
Abstract: The computation required for Gaussian process regression with n training examples is about O(n3 ) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression.
5 0.096336454 195 nips-2005-Transfer learning for text classification
Author: Chuong B. Do, Andrew Y. Ng
Abstract: Linear text classification algorithms work by computing an inner product between a test document vector and a parameter vector. In many such algorithms, including naive Bayes and most TFIDF variants, the parameters are determined by some simple, closed-form, function of training set statistics; we call this mapping mapping from statistics to parameters, the parameter function. Much research in text classification over the last few decades has consisted of manual efforts to identify better parameter functions. In this paper, we propose an algorithm for automatically learning this function from related classification problems. The parameter function found by our algorithm then defines a new learning algorithm for text classification, which we can apply to novel classification tasks. We find that our learned classifier outperforms existing methods on a variety of multiclass text classification tasks. 1
6 0.087175392 30 nips-2005-Assessing Approximations for Gaussian Process Classification
7 0.074019276 80 nips-2005-Gaussian Process Dynamical Models
8 0.073394865 185 nips-2005-Subsequence Kernels for Relation Extraction
9 0.073312499 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
10 0.072699927 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
11 0.072008945 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
12 0.065225057 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?
13 0.06509392 105 nips-2005-Large-Scale Multiclass Transduction
14 0.055956963 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
15 0.055931162 138 nips-2005-Non-Local Manifold Parzen Windows
16 0.054950614 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
17 0.054075085 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
18 0.053476866 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
19 0.053281672 114 nips-2005-Learning Rankings via Convex Hull Separation
20 0.05068057 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections
topicId topicWeight
[(0, 0.188), (1, 0.067), (2, -0.034), (3, 0.031), (4, 0.095), (5, -0.135), (6, 0.101), (7, -0.064), (8, 0.082), (9, 0.168), (10, 0.039), (11, 0.007), (12, 0.128), (13, 0.151), (14, 0.002), (15, 0.101), (16, 0.056), (17, -0.091), (18, -0.042), (19, -0.08), (20, -0.144), (21, -0.043), (22, -0.077), (23, -0.053), (24, -0.061), (25, -0.09), (26, -0.089), (27, 0.123), (28, 0.062), (29, 0.053), (30, -0.015), (31, 0.062), (32, -0.056), (33, 0.011), (34, 0.02), (35, 0.007), (36, -0.085), (37, 0.157), (38, -0.103), (39, 0.023), (40, 0.061), (41, -0.061), (42, 0.052), (43, -0.01), (44, 0.018), (45, 0.021), (46, -0.019), (47, -0.004), (48, 0.06), (49, 0.067)]
simIndex simValue paperId paperTitle
same-paper 1 0.93618125 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
Author: Sathiya Keerthi, Wei Chu
Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1
2 0.86367029 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
Author: Edward Snelson, Zoubin Ghahramani
Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M 2 N ) training cost and O(M 2 ) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M , i.e. very sparse solutions, and it significantly outperforms other approaches in this regime. 1
3 0.69844031 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
Author: Edward Meeds, Simon Osindero
Abstract: We present an infinite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani [1]; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian specification of the effective ‘gating network’ for the different experts. 1
4 0.49726242 30 nips-2005-Assessing Approximations for Gaussian Process Classification
Author: Malte Kuss, Carl E. Rasmussen
Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.
5 0.48317105 69 nips-2005-Fast Gaussian Process Regression using KD-Trees
Author: Yirong Shen, Matthias Seeger, Andrew Y. Ng
Abstract: The computation required for Gaussian process regression with n training examples is about O(n3 ) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression.
6 0.44240671 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
7 0.3984012 81 nips-2005-Gaussian Processes for Multiuser Detection in CDMA receivers
8 0.38198537 185 nips-2005-Subsequence Kernels for Relation Extraction
9 0.35800773 71 nips-2005-Fast Krylov Methods for N-Body Learning
10 0.34552091 105 nips-2005-Large-Scale Multiclass Transduction
11 0.34233624 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
12 0.33598438 195 nips-2005-Transfer learning for text classification
13 0.33226705 171 nips-2005-Searching for Character Models
14 0.3214919 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning
15 0.31971961 138 nips-2005-Non-Local Manifold Parzen Windows
16 0.31935948 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
17 0.31676257 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
18 0.31597576 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
19 0.31100187 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
20 0.30207303 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?
topicId topicWeight
[(3, 0.053), (10, 0.03), (18, 0.033), (27, 0.037), (31, 0.014), (34, 0.103), (39, 0.011), (41, 0.011), (50, 0.012), (54, 0.262), (55, 0.045), (69, 0.094), (73, 0.031), (88, 0.121), (91, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.80574119 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
Author: Sathiya Keerthi, Wei Chu
Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1
2 0.75527006 102 nips-2005-Kernelized Infomax Clustering
Author: David Barber, Felix V. Agakov
Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1
3 0.60775453 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
Author: Edward Snelson, Zoubin Ghahramani
Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M 2 N ) training cost and O(M 2 ) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M , i.e. very sparse solutions, and it significantly outperforms other approaches in this regime. 1
4 0.5918085 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery
Author: Jeremy Kubica, Joseph Masiero, Robert Jedicke, Andrew Connolly, Andrew W. Moore
Abstract: In this paper we consider the problem of finding sets of points that conform to a given underlying model from within a dense, noisy set of observations. This problem is motivated by the task of efficiently linking faint asteroid detections, but is applicable to a range of spatial queries. We survey current tree-based approaches, showing a trade-off exists between single tree and multiple tree algorithms. To this end, we present a new type of multiple tree algorithm that uses a variable number of trees to exploit the advantages of both approaches. We empirically show that this algorithm performs well using both simulated and astronomical data.
5 0.59126049 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard
Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1
6 0.58981723 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
7 0.58406925 114 nips-2005-Learning Rankings via Convex Hull Separation
8 0.577838 50 nips-2005-Convex Neural Networks
9 0.57770765 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
10 0.57745337 74 nips-2005-Faster Rates in Regression via Active Learning
11 0.57510304 151 nips-2005-Pattern Recognition from One Example by Chopping
12 0.57504666 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
13 0.57477927 45 nips-2005-Conditional Visual Tracking in Kernel Space
14 0.57371229 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
15 0.57369685 195 nips-2005-Transfer learning for text classification
16 0.57296008 160 nips-2005-Query by Committee Made Real
17 0.5709241 30 nips-2005-Assessing Approximations for Gaussian Process Classification
18 0.57034069 23 nips-2005-An Application of Markov Random Fields to Range Sensing
19 0.56991559 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
20 0.5680306 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification