nips nips2002 nips2002-201 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anton Schwaighofer, Volker Tresp
Abstract: Gaussian process regression allows a simple analytical treatment of exact Bayesian inference and has been found to provide good performance, yet scales badly with the number of training data. In this paper we compare several approaches towards scaling Gaussian processes regression to large data sets: the subset of representers method, the reduced rank approximation, online Gaussian processes, and the Bayesian committee machine. Furthermore we provide theoretical insight into some of our experimental results. We found that subset of representers methods can give good and particularly fast predictions for data sets with high and medium noise levels. On complex low noise data sets, the Bayesian committee machine achieves significantly better accuracy, yet at a higher computational cost.
Reference: text
sentIndex sentText sentNum sentScore
1 org 2 Abstract Gaussian process regression allows a simple analytical treatment of exact Bayesian inference and has been found to provide good performance, yet scales badly with the number of training data. [sent-6, score-0.156]
2 In this paper we compare several approaches towards scaling Gaussian processes regression to large data sets: the subset of representers method, the reduced rank approximation, online Gaussian processes, and the Bayesian committee machine. [sent-7, score-0.54]
3 We found that subset of representers methods can give good and particularly fast predictions for data sets with high and medium noise levels. [sent-9, score-0.391]
4 On complex low noise data sets, the Bayesian committee machine achieves significantly better accuracy, yet at a higher computational cost. [sent-10, score-0.203]
5 The subset of representer method (SRM), the reduced rank approximation (RRA), online Gaussian processes (OGP) and the Bayesian committee machine (BCM) are approaches to solving the scaling problems based on a finite dimensional approximation to the typically infinite dimensional Gaussian process. [sent-14, score-0.483]
6 For all of the discussed methods, we also examine asymptotic and actual runtime and investigate the accuracy versus speed trade-off. [sent-16, score-0.19]
7 A major difference of the methods discussed here is that the BCM performs transductive learning, whereas RRA, SRM and OGP methods perform induction style learning. [sent-17, score-0.314]
8 By transduction 1 we mean that a particular method computes a test set dependent model, i. [sent-18, score-0.118]
9 As a consequence, the BCM approximation is calculated when the inputs to the test data are known. [sent-21, score-0.137]
10 In contrast, inductive methods (RRA, OGP, SRM) build a model solely on basis of information from the training data. [sent-22, score-0.387]
11 2 presents the various inductive approaches to scaling GPR to large data, Sec. [sent-27, score-0.147]
12 1 Gaussian Process Regression ¡ We consider Gaussian process regression (GPR) on a set of training data D x i yi N 1 , i where targets are generated from an unknown function f via y i f xi ei with independent Gaussian noise e i of variance σ 2 . [sent-34, score-0.231]
13 We assume a Gaussian process prior on f x i , meaning that functional values f x i on points xi N 1 are jointly Gaussian distributed, i with zero mean and covariance matrix (or Gram matrix) K N . [sent-35, score-0.195]
14 K N itself is given by the kernel (or covariance) function k , with K iN k xi x j . [sent-36, score-0.103]
15 Mean and covariance of the GP xT can be written conveniently as © where 1 denotes a unit matrix and y prediction f on a set of test points x 1 with k xi x j . [sent-38, score-0.271]
16 (2) shows clearly what problem we may expect with large training data sets: The solution to a system of N linear equations requires O N 3 operations, and the size of the Gram matrix K N may easily exceed the memory capacity of an average work station. [sent-40, score-0.161]
17 (2), by replacing the kernel matrix K N with some approximation K N . [sent-43, score-0.175]
18 Williams and Seeger [12] use the Nystr¨ m method to calculate an approximation to the o first B eigenvalues and eigenvectors of K N . [sent-44, score-0.118]
19 Essentially, the Nystr¨ m method performs an o eigendecomposition of the B B covariance matrix K B , obtained from a set of B basis points selected at random out of the training data. [sent-45, score-0.363]
20 1 Originally, the differences between transductive and inductive learning where pointed out in statistical learning theory [10]. [sent-47, score-0.359]
21 Inductive methods minimize the expected loss over all possible test sets, whereas transductive methods minimize the expected loss for one particular test set. [sent-48, score-0.412]
22 In a special case, this reduces to ˜ K N K N K NB K B 1 K NB (4) B is the kernel matrix for the set of basis points, and K NB is the matrix of kernel where K evaluations between training and basis points. [sent-50, score-0.549]
23 2 Subset of Representers Method (SRM) Subset of representers methods replace Eq. [sent-54, score-0.203]
24 (1) by a linear combination of kernel functions on a set of B basis points, leading to an approximate predictor B ¡ ¡ 1 K NB ¢ ¢ K NB ¢ K NB y (6) i 1 ¥ ¡ ¢ ¡ ¦ σ2 K B (5) ¢ with an optimal weight vector ¡ β ∑ βi k x x i ¢ f˜ x Note that Eq. [sent-55, score-0.279]
25 (5) becomes exact if the kernel function allows a decomposition of the form k xi x j Ki B KB 1 K j B . [sent-56, score-0.103]
26 ¡ ¡ ¢ ¡ ¢ ¢ ¢ In practical implementation, one may expect different performance depending on the choice of the B basis points x 1 xB . [sent-57, score-0.221]
27 Different approaches for basis selection have been used in literature, we will discuss them in turn. [sent-58, score-0.179]
28 ¢ ©© ¢ Obviously, one may select the basis points at random (SRM Random) out of the training set. [sent-59, score-0.269]
29 In the sparse greedy matrix approximation (SRM SGMA, [6]) a subset of B basis kernel functions is selected such that all kernel functions on the training data can be well approximated by linear combinations of the selected basis kernels 2 . [sent-61, score-0.768]
30 If proximity in the associated reproducing kernel Hilbert space (RKHS) is chosen as the approximation criterion, the optimal linear combination (for a given basis set) can be computed analytically. [sent-62, score-0.281]
31 Smola and Sch¨ lkopf [6] introduce a greedy algorithm that finds a near optimal set of basis functions, o where the algorithm has the same asymptotic complexity O NB 2 as the SRM Random method. [sent-63, score-0.254]
32 ¡ ¢ Whereas the SGMA basis selection focuses only on the representation power of kernel functions, one can also design a basis selection scheme that takes into account the full likelihood model of the Gaussian process. [sent-64, score-0.453]
33 The underlying idea of the greedy posterior approximation algorithm (SRM PostApp, [7]) is to compare the log posterior of the subset of representers method and the full Gaussian process log posterior. [sent-65, score-0.534]
34 One thus can select basis functions in such a fashion that the SRM log posterior best approximates 3 the full GP log posterior, while keeping the total number of basis functions B minimal. [sent-66, score-0.474]
35 As for the case of SGMA, this algorithm can be formulated such that its asymptotic computational complexity is O NB2 , where B is the total number of basis functions selected. [sent-67, score-0.234]
36 3 However, Rasmussen [5] noted that Smola and Bartlett [7] falsely assume that the additive constant terms in the log likelihood remain constant during basis selection. [sent-70, score-0.166]
37 The posterior process is assumed to be Gaussian and is modeled by a set of basis vectors. [sent-72, score-0.236]
38 Upon arrival of a new data point, the updated (possibly nonGaussian) posterior process is being projected to the closest (in a KL-divergence sense) Gaussian posterior. [sent-73, score-0.116]
39 If this projection induces an error above a certain threshold, the newly arrived data point will be included in the set of basis vectors. [sent-74, score-0.164]
40 Similarly, basis vectors with minimum contribution to the posterior process may be removed from the basis set. [sent-75, score-0.378]
41 3 Transductive Methods for Approximate GPR In order to derive a transductive kernel classifier, we rewrite the Bayes optimal prediction Eq. [sent-76, score-0.355]
42 (3) can be expressed as 1 y gives a a weighted sum of kernel functions on test points. [sent-80, score-0.156]
43 (7), the term cov y f weighting of training observations y: Training points which cannot be predicted well from the functional values of the test points are given a lower weight. [sent-82, score-0.44]
44 Data points which are “closer” to the test points (in the sense that they can be predicted better) obtain a higher weight than data which are remote from the test points. [sent-83, score-0.234]
45 (7) still involves the inversion of the N N matrix cov y f 1 and thus does not make 1 , we obtain different a practical method. [sent-85, score-0.265]
46 By using different approximations for cov y f transductive methods, which we shall discuss in the next sections. [sent-86, score-0.441]
47 ¡ ¢ Note that in a Bayesian framework, transductive and inductive methods are equivalent, if we consider matching models (the true model for the data is in the family of models we consider for learning). [sent-88, score-0.432]
48 In this case, transductive methods allow us to focus on the actual region of interest, i. [sent-90, score-0.289]
49 1 Transductive SRM ¡ ¡ For large sets of test data, we may assume cov y f to be a diagonal matrix cov y f σ2 1, meaning that test values f allow a perfect prediction of training observations (up to noise). [sent-94, score-0.754]
50 (7) reduces to the prediction of a subset of representers method (see Sec. [sent-96, score-0.291]
51 2) where the test points are used as the set of basis points (SRM Trans). [sent-98, score-0.305]
52 2 Bayesian Committee Machine (BCM) ¡ For a smaller number of test data, assuming a diagonal matrix for cov y f (as for the transductive SRM method) seems unreasonable. [sent-100, score-0.528]
53 Instead, we can use the less stringent assumption of cov y f being block diagonal. [sent-101, score-0.235]
54 After some matrix manipulations, we obtain ¢ ¡ ¢ ¡ the following approximation for Eq. [sent-102, score-0.102]
55 In the BCM, D M of approximately same the training data D are partitioned into M disjoint sets D 1 size (“modules”), and M GPR predictors are trained on these subsets. [sent-104, score-0.111]
56 In the prediction stage, the BCM calculates the unknown responses f at a set of test points x1 xT at once. [sent-105, score-0.176]
57 The prediction E f D i of GPR module i is weighted by the inverse covariance of its prediction. [sent-106, score-0.099]
58 An intuitively appealing effect of this weighting scheme is that modules which are uncertain about their predictions are automatically weighted less than modules that are certain about their predictions. [sent-107, score-0.172]
59 The block diagonal approximation of cov y f becomes particularly accurate, if each D i contains data that is spatially separated from other training data. [sent-109, score-0.396]
60 4 Experimental Comparison In this section we will present an evaluation of the different approximation methods discussed in Sec. [sent-112, score-0.117]
61 ¡ ¢ For all data sets, we used a squared exponential kernel of the form k x i x j ¢ ¤£ ¢ 1 exp xi x j 2 , where the kernel parameter d was optimized individually for each 2d 2 method. [sent-119, score-0.198]
62 To allow a fair comparison, the subset selection methods SRM SGMA and SRM PostApp were forced to select a given number B of basis functions (instead of using the stopping criteria proposed by the authors of the respective methods). [sent-120, score-0.355]
63 Thus, all methods form their predictions as a linear combination of exactly B basis functions. [sent-121, score-0.223]
64 £ Table 1 shows the average remaining variance 5 in a 10-fold cross validation procedure on all data sets. [sent-122, score-0.099]
65 On the ABALONE data set (very high level of noise), all of the tested methods achieved almost identical performance, both with B 200 and B 1000 basis functions. [sent-125, score-0.241]
66 Out of the inductive 4 From the DELVE archive http://www. [sent-127, score-0.147]
67 edu/˜delve/ MSE variance 100 MSEmodel , where MSEmean is the MSE obtained from using the mean mean of training targets as the prediction for all test data. [sent-130, score-0.189]
68 Marked in bold are results that are significantly better (with a significance level of 99% or above in a paired t-test) than any of the other methods ¨ methods (SRM SGMA, SRM Random, SRM PostApp, RRA Nystr om) best performance was always achieved with SRM PostApp. [sent-134, score-0.159]
69 Comparing induction and transduction methods, we see that the BCM performs significantly better than any inductive method in most cases. [sent-143, score-0.216]
70 Here, the average MSE obtained with the BCM was only a fraction (25-30%) of the average MSE of the best inductive method. [sent-144, score-0.147]
71 By a paired t-test we confirmed that the BCM is significantly better than all other methods on the KIN40K and ART data sets, with significance level of 99% or above. [sent-145, score-0.13]
72 This reduces the performance of the BCM, since the block diagonal approximation of Eq. [sent-148, score-0.122]
73 Mind that all transductive methods necessarily lose their advantage over inductive methods, when the allowed model complexity (that is, the number of basis functions) is increased. [sent-152, score-0.552]
74 We further noticed that, on the KIN40K and ART data sets, SRM Trans consistently outperformed SRM Random, despite of SRM Trans being the most simplistic transductive method. [sent-153, score-0.234]
75 As mentioned above, we did not make use of the stopping criterion proposed for the SRM PostApp method, namely the relative gap between SRM log posterior and the log posterior of the full Gaussian process model. [sent-155, score-0.274]
76 In [7], the authors suggest that the gap is indicative of the generalization performance of the SRM model and use a gap of 2 5% in their experiments. [sent-156, score-0.116]
77 For example, selecting 200 basis points out of the KIN40K data set gave a gap of 1%, indicating a good fit. [sent-158, score-0.279]
78 As shown in Table 1, a significantly better error was achieved with 1000 basis functions (giving a gap of 3 5 10 4 ). [sent-159, score-0.234]
79 Thus, it remains open how one can automatically choose an appropriate basis set size B. [sent-160, score-0.142]
80 ¨ ¥ to the numerically demanding approximations, runtime of the OGP method for B rather long. [sent-161, score-0.13]
81 We thus only list results for B 200 basis functions. [sent-162, score-0.142]
82 6 Due 1000 is ¥ Memory consumption Runtime Initialization Prediction Initialization Prediction KIN40K O O NB O N O N O O NB2 O N O N N/A 4 min 3 min 3 min 7h 11 h est. [sent-163, score-0.129]
83 For the BCM, we assume here that training and test data are partitioned into modules of size B. [sent-165, score-0.178]
84 Asymptotic cost for predictions show the cost per test point. [sent-166, score-0.135]
85 The actual runtime is given for the KIN40K data set, with 36000 training examples, 4000 test patterns and B 1000 basis functions for each method. [sent-167, score-0.426]
86 1 Computational Cost Table 2 shows the asymptotic computational cost for all approximation methods we have described in Sec. [sent-169, score-0.203]
87 The subset of representers methods (SRM) show the most favorable cost for the prediction stage, since the resulting model consists only of B basis functions with their associated weight vector. [sent-171, score-0.522]
88 Table 2 also lists the actual runtime 7 for one (out of 10) cross validation runs on the KIN40K data set. [sent-172, score-0.208]
89 Here, methods with the same asymptotic complexity exhibit runtimes ranging from 3 minutes to 150 hours. [sent-173, score-0.109]
90 For the SRM methods, most of this time is spent for basis selection (SRM PostApp and SRM SGMA). [sent-174, score-0.179]
91 We thus consider the slow basis selection as the bottleneck for SRM methods when working with larger number of basis functions or larger data sets. [sent-175, score-0.428]
92 ˜ Using matrix perturbation theory, we can show that the relative error of the approximate w is bounded by ˜ ˜ w w λi λi (11) max ˜ i w λi σ2 ¦ ¢ £ £ £ £ ˜ ˜ where λi and λi denote eigenvalues of K N resp. [sent-182, score-0.094]
93 A closer look at the Nystr¨ m approxo imation [11] revealed that already for moderately complex data sets, such as KIN8NM, it tends to underestimate eigenvalues of the Gram matrix, unless a very high number of basis points is used. [sent-184, score-0.249]
94 While being painfully slow during basis selection, the resulting models are compact, easy to use and accurate. [sent-191, score-0.142]
95 Online Gaussian processes achieve a slightly worse accuracy, yet they are the only (inductive) method that can easily be adapted for general likelihood models, such as classification and regression with nonGaussian noise. [sent-192, score-0.125]
96 On the other hand, if accurate predictions are the major concern, one may expect best results with the Bayesian committee machine. [sent-194, score-0.179]
97 On complex low noise data sets (such as KIN40K and ART) we observed significant advantages in terms of prediction accuracy, giving an average mean squared error that was only a fraction (25-30%) of the error achieved by the best inductive method. [sent-195, score-0.308]
98 For the BCM, one must take into account that it is a transduction scheme, thus prediction time and memory consumption are larger than those of SRM methods. [sent-196, score-0.194]
99 Although all discussed approaches scale linearly in the number of training data, they exhibit significantly different runtime in practice. [sent-197, score-0.153]
100 Using the nystr¨ method to speed up kernel machines. [sent-269, score-0.097]
wordName wordTfidf (topN-words)
[('srm', 0.592), ('bcm', 0.315), ('rra', 0.228), ('transductive', 0.212), ('cov', 0.205), ('gpr', 0.199), ('postapp', 0.152), ('representers', 0.152), ('sgma', 0.152), ('inductive', 0.147), ('basis', 0.142), ('nb', 0.139), ('nystr', 0.136), ('committee', 0.127), ('runtime', 0.106), ('nystrom', 0.095), ('ogp', 0.076), ('kernel', 0.073), ('prediction', 0.07), ('gaussian', 0.066), ('om', 0.066), ('approximation', 0.066), ('trans', 0.063), ('modules', 0.06), ('art', 0.06), ('online', 0.059), ('asymptotic', 0.058), ('gap', 0.058), ('points', 0.057), ('abalone', 0.056), ('greedy', 0.054), ('mse', 0.053), ('posterior', 0.051), ('methods', 0.051), ('kn', 0.05), ('test', 0.049), ('training', 0.047), ('gp', 0.047), ('consumption', 0.045), ('transduction', 0.045), ('subset', 0.045), ('process', 0.043), ('sets', 0.042), ('bayesian', 0.041), ('rasmussen', 0.04), ('regression', 0.039), ('schwaighofer', 0.038), ('selection', 0.037), ('tresp', 0.037), ('matrix', 0.036), ('rank', 0.035), ('processes', 0.035), ('memory', 0.034), ('functions', 0.034), ('signi', 0.034), ('anton', 0.033), ('paired', 0.031), ('gram', 0.03), ('cross', 0.03), ('block', 0.03), ('predictions', 0.03), ('table', 0.03), ('xi', 0.03), ('approximate', 0.03), ('covariance', 0.029), ('smola', 0.029), ('concern', 0.028), ('csat', 0.028), ('eigendecomposition', 0.028), ('nongaussian', 0.028), ('cost', 0.028), ('williams', 0.028), ('eigenvalues', 0.028), ('min', 0.028), ('cantly', 0.027), ('noise', 0.027), ('yet', 0.027), ('graz', 0.027), ('opper', 0.027), ('ki', 0.027), ('actual', 0.026), ('diagonal', 0.026), ('reduced', 0.026), ('level', 0.026), ('observations', 0.025), ('cance', 0.025), ('inversion', 0.024), ('seeger', 0.024), ('validation', 0.024), ('method', 0.024), ('log', 0.024), ('approximations', 0.024), ('stopping', 0.023), ('variance', 0.023), ('select', 0.023), ('data', 0.022), ('scheme', 0.022), ('expect', 0.022), ('medium', 0.022), ('verlag', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
Author: Anton Schwaighofer, Volker Tresp
Abstract: Gaussian process regression allows a simple analytical treatment of exact Bayesian inference and has been found to provide good performance, yet scales badly with the number of training data. In this paper we compare several approaches towards scaling Gaussian processes regression to large data sets: the subset of representers method, the reduced rank approximation, online Gaussian processes, and the Bayesian committee machine. Furthermore we provide theoretical insight into some of our experimental results. We found that subset of representers methods can give good and particularly fast predictions for data sets with high and medium noise levels. On complex low noise data sets, the Bayesian committee machine achieves significantly better accuracy, yet at a higher computational cost.
2 0.12232077 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger
Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1
3 0.11267551 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
Author: Anton Schwaighofer, Volker Tresp, Peter Mayer, Alexander K. Scheel, Gerhard A. Müller
Abstract: We describe the RA scanner, a novel system for the examination of patients suffering from rheumatoid arthritis. The RA scanner is based on a novel laser-based imaging technique which is sensitive to the optical characteristics of finger joint tissue. Based on the laser images, finger joints are classified according to whether the inflammatory status has improved or worsened. To perform the classification task, various linear and kernel-based systems were implemented and their performances were compared. Special emphasis was put on measures to reliably perform parameter tuning and evaluation, since only a very small data set was available. Based on the results presented in this paper, it was concluded that the RA scanner permits a reliable classification of pathological finger joints, thus paving the way for a further development from prototype to product stage.
4 0.10988826 110 nips-2002-Incremental Gaussian Processes
Author: Joaquin Quiñonero-candela, Ole Winther
Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean field approach to the intractable classification model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.
5 0.10604654 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf
Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1
6 0.092521206 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
7 0.075250544 120 nips-2002-Kernel Design Using Boosting
8 0.07338497 10 nips-2002-A Model for Learning Variance Components of Natural Images
9 0.07032571 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
10 0.068386532 156 nips-2002-On the Complexity of Learning the Kernel Matrix
11 0.064112023 95 nips-2002-Gaussian Process Priors with Uncertain Inputs Application to Multiple-Step Ahead Time Series Forecasting
12 0.064104445 106 nips-2002-Hyperkernels
13 0.062067576 165 nips-2002-Ranking with Large Margin Principle: Two Approaches
14 0.055900075 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
15 0.054569706 40 nips-2002-Bayesian Models of Inductive Generalization
16 0.054527111 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
17 0.054478429 124 nips-2002-Learning Graphical Models with Mercer Kernels
18 0.053684629 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
19 0.053584348 119 nips-2002-Kernel Dependency Estimation
20 0.051803961 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
topicId topicWeight
[(0, -0.174), (1, -0.065), (2, 0.022), (3, -0.018), (4, -0.009), (5, -0.024), (6, -0.026), (7, 0.05), (8, 0.044), (9, 0.044), (10, 0.036), (11, -0.042), (12, 0.123), (13, 0.036), (14, 0.085), (15, -0.046), (16, -0.016), (17, -0.041), (18, 0.084), (19, 0.071), (20, 0.04), (21, 0.008), (22, 0.038), (23, 0.003), (24, 0.041), (25, -0.106), (26, -0.028), (27, 0.086), (28, 0.039), (29, 0.045), (30, 0.142), (31, 0.121), (32, 0.033), (33, 0.139), (34, -0.03), (35, 0.048), (36, -0.084), (37, -0.142), (38, 0.019), (39, -0.114), (40, 0.187), (41, 0.048), (42, -0.157), (43, 0.063), (44, -0.079), (45, 0.137), (46, 0.029), (47, -0.038), (48, -0.039), (49, -0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.92834508 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
Author: Anton Schwaighofer, Volker Tresp
Abstract: Gaussian process regression allows a simple analytical treatment of exact Bayesian inference and has been found to provide good performance, yet scales badly with the number of training data. In this paper we compare several approaches towards scaling Gaussian processes regression to large data sets: the subset of representers method, the reduced rank approximation, online Gaussian processes, and the Bayesian committee machine. Furthermore we provide theoretical insight into some of our experimental results. We found that subset of representers methods can give good and particularly fast predictions for data sets with high and medium noise levels. On complex low noise data sets, the Bayesian committee machine achieves significantly better accuracy, yet at a higher computational cost.
2 0.72881901 95 nips-2002-Gaussian Process Priors with Uncertain Inputs Application to Multiple-Step Ahead Time Series Forecasting
Author: Agathe Girard, Carl Edward Rasmussen, Joaquin Quiñonero Candela, Roderick Murray-Smith
Abstract: We consider the problem of multi-step ahead prediction in time series analysis using the non-parametric Gaussian process model. -step ahead forecasting of a discrete-time non-linear dynamic system can be performed by doing repeated one-step ahead predictions. For a state-space model of the form , the prediction of at time is based on the point estimates of the previous outputs. In this paper, we show how, using an analytical Gaussian approximation, we can formally incorporate the uncertainty about intermediate regressor values, thus updating the uncertainty on the current prediction. ¡ % # ¢ ¡ ¢ ¡¨ ¦ ¤ ¢ $
3 0.68565804 110 nips-2002-Incremental Gaussian Processes
Author: Joaquin Quiñonero-candela, Ole Winther
Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean field approach to the intractable classification model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.
4 0.61026013 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger
Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1
5 0.49894589 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
Author: E. Solak, R. Murray-smith, W. E. Leithead, D. J. Leith, Carl E. Rasmussen
Abstract: Gaussian processes provide an approach to nonparametric modelling which allows a straightforward combination of function and derivative observations in an empirical model. This is of particular importance in identification of nonlinear dynamic systems from experimental data. 1) It allows us to combine derivative information, and associated uncertainty with normal function observations into the learning and inference process. This derivative information can be in the form of priors specified by an expert or identified from perturbation data close to equilibrium. 2) It allows a seamless fusion of multiple local linear models in a consistent manner, inferring consistent models and ensuring that integrability constraints are met. 3) It improves dramatically the computational efficiency of Gaussian process models for dynamic system identification, by summarising large quantities of near-equilibrium data by a handful of linearisations, reducing the training set size – traditionally a problem for Gaussian process models.
6 0.46354082 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
7 0.4163101 41 nips-2002-Bayesian Monte Carlo
8 0.39938071 40 nips-2002-Bayesian Models of Inductive Generalization
9 0.38489616 22 nips-2002-Adaptive Nonlinear System Identification with Echo State Networks
10 0.38426781 138 nips-2002-Manifold Parzen Windows
11 0.36780289 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
12 0.35391101 111 nips-2002-Independent Components Analysis through Product Density Estimation
13 0.34543824 6 nips-2002-A Formulation for Minimax Probability Machine Regression
14 0.32978684 106 nips-2002-Hyperkernels
15 0.32591879 115 nips-2002-Informed Projections
16 0.325748 119 nips-2002-Kernel Dependency Estimation
17 0.32297423 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement
18 0.31893927 156 nips-2002-On the Complexity of Learning the Kernel Matrix
19 0.31832406 188 nips-2002-Stability-Based Model Selection
20 0.30741021 124 nips-2002-Learning Graphical Models with Mercer Kernels
topicId topicWeight
[(11, 0.014), (23, 0.024), (42, 0.079), (54, 0.106), (55, 0.055), (57, 0.011), (67, 0.01), (68, 0.022), (73, 0.336), (74, 0.083), (92, 0.036), (98, 0.134)]
simIndex simValue paperId paperTitle
1 0.80854559 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
Author: Eric Allender, Sanjeev Arora, Michael Kearns, Cristopher Moore, Alexander Russell
Abstract: We establish a new hardness result that shows that the difficulty of planning in factored Markov decision processes is representational rather than just computational. More precisely, we give a fixed family of factored MDPs with linear rewards whose optimal policies and value functions simply cannot be represented succinctly in any standard parametric form. Previous hardness results indicated that computing good policies from the MDP parameters was difficult, but left open the possibility of succinct function approximation for any fixed factored MDP. Our result applies even to policies which yield a polynomially poor approximation to the optimal value, and highlights interesting connections with the complexity class of Arthur-Merlin games.
same-paper 2 0.76135665 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
Author: Anton Schwaighofer, Volker Tresp
Abstract: Gaussian process regression allows a simple analytical treatment of exact Bayesian inference and has been found to provide good performance, yet scales badly with the number of training data. In this paper we compare several approaches towards scaling Gaussian processes regression to large data sets: the subset of representers method, the reduced rank approximation, online Gaussian processes, and the Bayesian committee machine. Furthermore we provide theoretical insight into some of our experimental results. We found that subset of representers methods can give good and particularly fast predictions for data sets with high and medium noise levels. On complex low noise data sets, the Bayesian committee machine achieves significantly better accuracy, yet at a higher computational cost.
3 0.71347165 74 nips-2002-Dynamic Structure Super-Resolution
Author: Amos J. Storkey
Abstract: The problem of super-resolution involves generating feasible higher resolution images, which are pleasing to the eye and realistic, from a given low resolution image. This might be attempted by using simple filters for smoothing out the high resolution blocks or through applications where substantial prior information is used to imply the textures and shapes which will occur in the images. In this paper we describe an approach which lies between the two extremes. It is a generic unsupervised method which is usable in all domains, but goes beyond simple smoothing methods in what it achieves. We use a dynamic tree-like architecture to model the high resolution data. Approximate conditioning on the low resolution image is achieved through a mean field approach. 1
4 0.53326821 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
Author: Nicholas Roy, Geoffrey J. Gordon
Abstract: Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are intractable for large models. The intractability of these algorithms is due to a great extent to their generating an optimal policy over the entire belief space. However, in real POMDP problems most belief states are unlikely, and there is a structured, low-dimensional manifold of plausible beliefs embedded in the high-dimensional belief space. We introduce a new method for solving large-scale POMDPs by taking advantage of belief space sparsity. We reduce the dimensionality of the belief space by exponential family Principal Components Analysis [1], which allows us to turn the sparse, highdimensional belief space into a compact, low-dimensional representation in terms of learned features of the belief state. We then plan directly on the low-dimensional belief features. By planning in a low-dimensional space, we can find policies for POMDPs that are orders of magnitude larger than can be handled by conventional techniques. We demonstrate the use of this algorithm on a synthetic problem and also on a mobile robot navigation task.
5 0.52993888 110 nips-2002-Incremental Gaussian Processes
Author: Joaquin Quiñonero-candela, Ole Winther
Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean field approach to the intractable classification model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.
6 0.52992618 10 nips-2002-A Model for Learning Variance Components of Natural Images
7 0.52694571 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
8 0.52556777 46 nips-2002-Boosting Density Estimation
9 0.5240584 41 nips-2002-Bayesian Monte Carlo
10 0.52326804 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
11 0.5228563 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
12 0.52265269 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
13 0.521824 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
14 0.52172893 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals
15 0.52081048 3 nips-2002-A Convergent Form of Approximate Policy Iteration
16 0.52051169 2 nips-2002-A Bilinear Model for Sparse Coding
17 0.52011102 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
18 0.51886678 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
19 0.5187512 135 nips-2002-Learning with Multiple Labels
20 0.518188 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems