nips nips2002 nips2002-86 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger
Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. [sent-8, score-0.511]
2 Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. [sent-9, score-0.318]
3 In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. [sent-11, score-0.305]
4 1 Introduction Gaussian process (GP) models are powerful non-parametric tools for approximate Bayesian inference and learning. [sent-12, score-0.094]
5 However, their training time scaling of O(n3 ) and memory scaling of O(n2 ), where n the number of training points, has hindered their more widespread use. [sent-14, score-0.308]
6 prediction error at a fraction of the training cost. [sent-18, score-0.156]
7 This is possible because many tasks can be solved satisfactorily using sparse representations of the data set. [sent-19, score-0.159]
8 the final predictor depends only on a fraction of training points crucial for good discrimination on the task. [sent-22, score-0.303]
9 Here, we call these utilized points the active set of the sparse predictor. [sent-23, score-0.427]
10 In case of SVM classification, the active set contains the support vectors, the points closest to 1 An SVM classifier is trained by minimizing a regularized loss functional, a process which cannot be interpreted as approximation to Bayesian inference. [sent-24, score-0.371]
11 If the active set size d is much smaller than n, an SVM classifier can be trained in average case running time between O(n · d2 ) and O(n2 · d) with memory requirements significantly less than n2 . [sent-26, score-0.298]
12 In an effort to overcome scaling problems a range of sparse GP approximations have been proposed [1, 8, 9, 10, 11]. [sent-28, score-0.227]
13 The algorithm proposed here accomplishes these objectives and, as our experiments show, can even be significantly faster in training than the SVM. [sent-33, score-0.093]
14 Our approach builds on earlier work of Lawrence and Herbrich [2] which we extend here by considering randomized greedy selections and focusing on an alternative representation of the GP model which facilitates generalizations to settings such as regression and multi-class classification. [sent-38, score-0.457]
15 Section 3 then contains the derivation of our fast greedy approximation and a description of the associated algorithm. [sent-40, score-0.276]
16 The density of the Gaussian distribution with mean µ and covariance matrix Σ is denoted by N (·|µ, Σ). [sent-48, score-0.096]
17 , (xn , yn )), xi ∈ X , yi ∈ {−1, +1}, drawn independently and identically distributed (i. [sent-53, score-0.109]
18 From the Bayesian viewpoint, the relationship x → u is a random process u(·), which, in a Gaussian process (GP) model, is given a GP prior with mean function 0 and covariance kernel k(·, ·). [sent-59, score-0.267]
19 , u(x p ))T ˜ ˜ are jointly Gaussian with mean 0 ∈ Rp and covariance matrix (k(x i , x j ))i,j ∈ Rp,p . [sent-66, score-0.096]
20 3 We focus on binary classification, but our framework can be applied straightforwardly to regression estimation and multi-class classification. [sent-68, score-0.083]
21 This linear function view, under which predictors become separating hyper-planes in F, is frequently used in the SVM community. [sent-74, score-0.094]
22 However, F is, in general, infinite-dimensional and not uniquely determined by the kernel function k. [sent-75, score-0.09]
23 We denote the sequence of latent outputs at the training points by u := (u(x1 ), . [sent-76, score-0.176]
24 , u(xn ))T ∈ Rn and the covariance or kernel matrix by K := (k(xi , xj ))i,j ∈ Rn,n . [sent-79, score-0.186]
25 The Bayesian posterior process for u(·) can be computed in principle using Bayes’ formula. [sent-80, score-0.098]
26 However, if the noise model P (y|u) is non-Gaussian (as is the case for binary classification), it cannot be handled tractably and is usually approximated by another Gaussian process, which should ideally preserve mean and covariance function of the former. [sent-81, score-0.101]
27 In general, computing Q is also infeasible, but several authors have proposed to approximate the global moment matching by iterative schemes which locally focus on one training pattern at a time [1, 4]. [sent-83, score-0.198]
28 These schemes (at least in their simplest forms) result in a parametric form for the approximating Gaussian n Q(u) ∝ P (u) i=1 exp − pi (ui − mi )2 . [sent-84, score-0.288]
29 2 (1) This may be compared with the form of the true posterior P (u|S) ∝ P (u) n P (yi |ui ) and shows that Q(u) is obtained from P (u|S) by a likelihood i=1 approximation. [sent-85, score-0.083]
30 In order to update the parameters for a site i, we replace it in Q(u) by the corresponding true likelihood factor P (yi |ui ), resulting in a non-Gaussian distribution whose mean and covariance matrix can still be computed. [sent-88, score-0.24]
31 The site update is called the inclusion of i into the active set I. [sent-90, score-0.38]
32 The factorized form of the likelihood implies that the new and old Q differ only in the parameters pi , mi of site i. [sent-91, score-0.355]
33 This is a useful locality property of the scheme which is referred to as assumed density filtering (ADF) (e. [sent-92, score-0.107]
34 3 Sparse Gaussian Process Classification The simplest way to obtain a sparse Gaussian process classification (GPC) approximation from the ADF scheme is to leave most of the site parameters at 0, i. [sent-96, score-0.367]
35 For this to succeed, it is important to choose I so that the decision boundary between classes is represented essentially as accurately as if we used the whole training set. [sent-102, score-0.093]
36 Here, we follow a greedy approach suggested in [2], including new patterns one at a time into I. [sent-104, score-0.174]
37 The selection of a pattern to include is made by computing a score function for 4 A generalization of ADF, expectation propagation (EP) [4], allows for several iterations over the data. [sent-105, score-0.24]
38 In the context of sparse approximations, it allows us to remove points from I or exchange them against such outside I, although we do not consider such moves here. [sent-106, score-0.208]
39 end for i = argmaxj∈J ∆j Do updates for pi and mi according to (2). [sent-113, score-0.253]
40 The heuristic we implement has also been considered in the context of active learning (see chapter 5 of [3]): score an example (xi , yi ) by the decrease in entropy of Q(·) upon its inclusion. [sent-120, score-0.479]
41 As a result of the locality property of ADF and the fact that Q is Gaussian, it is easy to see that the entropy difference H[Qnew ] − H[Q] is proportional to the log ratio between the variances of the marginals Qnew (ui ) and Q(ui ). [sent-121, score-0.126]
42 Thus, our heuristic (referred to as the differential entropy score) favors points whose inclusion leads to a large reduction in predictive (posterior) variance at the corresponding site. [sent-122, score-0.423]
43 Whilst other selection heuristics can be argued for and utilized, it turns out that the differential entropy score together with the simple likelihood approximation in (1) leads to an extremely efficient and competitive algorithm. [sent-123, score-0.426]
44 If I is the current active set, then all components of p and m not in I are zero, and some algebra using the Woodbury formula gives 1/2 A = K − M TM , M = L−1 ΠI K I,· ∈ Rd,n , where L is the lower-triangular Cholesky factor of 1/2 1/2 B = I + ΠI K I ΠI ∈ Rd,d . [sent-127, score-0.176]
45 In order to compute the differential entropy score for a point j ∈ I, we have to know aj,j and hj . [sent-128, score-0.217]
46 Thus, when including i into the active set I, we need to update diag(A) and h accordingly, which in turn requires the matrices L and M to be kept up-to-date. [sent-129, score-0.218]
47 The update equations for pi , mi are νi αi pi = , mi = h i + , where 1 − ai,i νi νi (2) yi · N (zi |0, 1) hi + b yi · (hi + b) , αi = , ν i = α i αi + zi = . [sent-130, score-0.754]
48 1 + ai,i 1 + ai,i Φ(zi ) 1 + ai,i We then update L → Lnew by appending the row (lT , l) and M → M new by appending the row µT , where √ √ (3) l = pi M ·,i , l = 1 + pi K i,i − lT l, µ = l−1 ( pi K ·,i − M T l). [sent-131, score-0.526]
49 The differential j entropy score for j ∈ I can be computed based on the variables in (2) (with i → j) as 1 ∆j = log(1 − aj,j νj ), (4) 2 which can be computed in O(1), given hj and aj,j . [sent-133, score-0.217]
50 Each inclusion costs O(n · d), dominated by the computation of µ, apart from the computation of the kernel matrix column K ·,i . [sent-135, score-0.259]
51 Given diag(A) and h, the error or the expected log likelihood of the current predictor on the remaining points J can be computed in O(n). [sent-138, score-0.189]
52 These scores can be used in order to decide how many points to include into the final I. [sent-139, score-0.115]
53 For kernel functions with constant diagonal, our selection heuristic is constant over patterns if I = ∅, so the first (or the first few) inclusion candidate is chosen at random. [sent-140, score-0.397]
54 The approximate predictive distribution over y∗ can be obtained by averaging the noise model over the Gaussian. [sent-143, score-0.134]
55 The optimal predictor for the approximation is sgn(µ(x∗ )+b), which is independent of the variance σ 2 (x∗ ). [sent-144, score-0.119]
56 The simple scheme above employs full greedy selection over all remaining points to find the inclusion candidate. [sent-145, score-0.582]
57 This is sensible during early inclusions, but computationally wasteful during later ones, and an important extension of the basic scheme of [2] allows for randomized greedy selections. [sent-146, score-0.415]
58 To this end, we maintain a selection index J ⊂ {1, . [sent-147, score-0.153]
59 Having included i into I we modify the selection index J. [sent-151, score-0.153]
60 After a number of initial inclusions are done using full greedy selection, we use a J of fixed size m together with the following modification rule: for a fraction τ ∈ (0, 1), retain the τ · m best-scoring points in J, then fill it up to size m by drawing at random from {1, . [sent-158, score-0.471]
61 We down-sampled the bitmaps to size 13 × 13 and split the MNIST training set into a (new) training set of size n = 59000 and a validation set of size 1000; the test set size is 10000. [sent-168, score-0.405]
62 A run consisted of model selection, training and testing, and all results are averaged over 10 runs. [sent-169, score-0.093]
63 We employed the RBF kernel k(x, x ) = C exp(−(γ/(2 · 169)) x − x 2 ), x ∈ R169 with hyper-parameters C > 0 (process variance) and γ > 0 (inverse squared length-scale). [sent-170, score-0.09]
64 Model selection was done by minimizing validation set error, training on random training set subsets of size 5000. [sent-171, score-0.423]
65 The model selection training set for a run i is the same across tested methods. [sent-177, score-0.246]
66 The list of kernel parameters considered for selection has the same size across methods. [sent-178, score-0.288]
67 6 SVM c 0 1 2 3 4 5 6 7 8 9 d 1247 798 2240 2610 1826 2306 1331 1759 2636 2731 gen 0. [sent-179, score-0.092]
68 58 IVM time 1281 864 2977 3687 2442 2771 1520 2251 3909 3469 c 0 1 2 3 4 5 6 7 8 9 d 1130 820 2150 2500 1740 2200 1270 1660 2470 2740 gen 0. [sent-189, score-0.092]
69 55 time 627 427 1690 2191 1210 1758 765 1110 2024 2444 Table 1: Test error rates (gen, %) and training times (time, s) on binary MNIST tasks. [sent-199, score-0.131]
70 IVM: Sparse GPC, randomized greedy selections; d: final active set size. [sent-201, score-0.487]
71 For the SVM, we chose the SMO algorithm [6] together with a fast elaborate kernel matrix cache (see [7] for details). [sent-207, score-0.219]
72 For the IVM, we employed randomized greedy selections with fairly conservative settings. [sent-208, score-0.49]
73 We simply fixed b = Φ−1 (r), where r is the ratio between +1 and −1 patterns in the training set, and added a constant vb = 1/10 to the kernel k to account for the variance of the bias hyper-parameter. [sent-210, score-0.222]
74 To ensure a fair comparison, we did initial SVM runs and initialized the active set size d with the average number (over 10 runs) of SVs found, independently for each c. [sent-212, score-0.266]
75 Note that IVM shows comparable performance to the SVM, while achieving significantly lower training times. [sent-215, score-0.093]
76 For less conservative settings of the randomized selection parameters, further speed-ups might be realizable. [sent-216, score-0.323]
77 We also registered (not shown here) significant fluctuations in training time for the SVM runs, while this figure is stable and a-priori predictable for the IVM. [sent-217, score-0.138]
78 Within the IVM, we can obtain estimates of predictive probabilities for test points, quantifying prediction uncertainties. [sent-218, score-0.097]
79 For the SVM, the size of the discriminant output is often used to quantify predictive uncertainty heuristically. [sent-220, score-0.142]
80 For the IVM, the 7 First 2 selections at random, then 198 using full greedy, after that a selection index of size 500 and a retained fraction τ = 1/2. [sent-223, score-0.407]
81 For SVM, we reject based on “distance” from separating plane, for IVM based on estimates of predictive probabilities. [sent-232, score-0.172]
82 The IVM line runs below the SVM line exhibiting lower classification errors for identical rejection rates. [sent-233, score-0.099]
83 9 However, the estimates of log P (y∗ = +1) do depend on predictive variances, i. [sent-236, score-0.097]
84 a measure of uncertainty about the predictive mean, which cannot be properly obtained within the SVM framework. [sent-238, score-0.097]
85 5 Discussion We have demonstrated that sparse Gaussian process classifiers can be constructed efficiently using greedy selection with a simple fast selection criterion. [sent-245, score-0.709]
86 Although we focused on the change in differential entropy in our experiments here, the simple likelihood approximation at the basis of our method allows for other equally efficient criteria such as information gain [3]. [sent-246, score-0.186]
87 ) while being much faster and more memory-efficient both in training and prediction. [sent-248, score-0.093]
88 This is due to the fact that SMO’s active set typically fluctuates heavily across the training set, thus a large fraction of the full kernel matrix must be evaluated. [sent-251, score-0.455]
89 9 It is straightforward to obtain the IVM for a joint GP classification model, however the training costs raise by a factor of c2 . [sent-253, score-0.093]
90 10 We would expect SVMs to catch up with IVMs on tasks which require fairly large active sets, and for which very simple and fast covariance functions are appropriate (e. [sent-255, score-0.32]
91 Among the many proposed sparse GP approximations [1, 8, 9, 10, 11], our method is most closely related to [1]. [sent-258, score-0.188]
92 The latter is a sparse Bayesian online scheme which does not employ greedy selections and uses a more accurate likelihood approximation than we do, at the expense of slightly worse training time scaling, especially when compared with our randomized version. [sent-259, score-0.842]
93 It also requires the specification of a rejection threshold and is dependent on the ordering in which the training points are presented. [sent-260, score-0.23]
94 It incorporates steps to remove points from I, which can also be done straightforwardly in our scheme, however such moves are likely to create numerical stability problems. [sent-261, score-0.128]
95 The differential entropy score has previously been suggested in the context of active learning (e. [sent-263, score-0.352]
96 In active learning, the label yi is not known at the time xi has to be scored, and expected rather than actual entropy changes have to be considered. [sent-266, score-0.374]
97 Furthermore, MacKay [3] applies the selection to multi-layer perceptron (MLP) models for which Gaussian posterior approximations over the weights can be very poor. [sent-267, score-0.257]
98 A sparse Bayesian compression scheme - the informative vector machine. [sent-277, score-0.25]
99 Fast training of support vector machines using sequential minimal optimization. [sent-292, score-0.093]
100 Using the Nystr¨m method to speed o up kernel machines. [sent-324, score-0.09]
wordName wordTfidf (topN-words)
[('ivm', 0.43), ('svm', 0.262), ('gp', 0.242), ('active', 0.176), ('greedy', 0.174), ('diag', 0.155), ('selection', 0.153), ('selections', 0.146), ('randomized', 0.137), ('erential', 0.136), ('classi', 0.134), ('mi', 0.127), ('pi', 0.126), ('sparse', 0.125), ('neil', 0.123), ('di', 0.115), ('adf', 0.107), ('inclusion', 0.102), ('predictive', 0.097), ('training', 0.093), ('gen', 0.092), ('qnew', 0.092), ('smo', 0.091), ('kernel', 0.09), ('entropy', 0.089), ('score', 0.087), ('points', 0.083), ('gi', 0.082), ('mnist', 0.082), ('gaussian', 0.08), ('gpc', 0.08), ('bayesian', 0.078), ('yi', 0.075), ('ui', 0.074), ('scheme', 0.07), ('cation', 0.066), ('lawrence', 0.066), ('predictor', 0.064), ('fraction', 0.063), ('approximations', 0.063), ('covariance', 0.063), ('inclusions', 0.061), ('matthias', 0.061), ('ralf', 0.061), ('site', 0.06), ('seeger', 0.059), ('process', 0.057), ('zi', 0.056), ('manfred', 0.056), ('approximation', 0.055), ('informative', 0.055), ('predictors', 0.054), ('rejection', 0.054), ('appending', 0.053), ('lehel', 0.053), ('heuristic', 0.052), ('cache', 0.049), ('er', 0.047), ('fast', 0.047), ('predictable', 0.045), ('csat', 0.045), ('straightforwardly', 0.045), ('runs', 0.045), ('size', 0.045), ('memory', 0.044), ('lt', 0.043), ('opper', 0.043), ('utilized', 0.043), ('likelihood', 0.042), ('update', 0.042), ('posterior', 0.041), ('uctuations', 0.041), ('hj', 0.041), ('separating', 0.04), ('erent', 0.04), ('scaling', 0.039), ('vb', 0.039), ('validation', 0.039), ('binary', 0.038), ('locality', 0.037), ('microsoft', 0.037), ('mackay', 0.037), ('approximate', 0.037), ('cantly', 0.037), ('herbrich', 0.036), ('reject', 0.035), ('schemes', 0.035), ('xi', 0.034), ('tasks', 0.034), ('restrictions', 0.034), ('sensible', 0.034), ('dominated', 0.034), ('moment', 0.033), ('requirements', 0.033), ('conservative', 0.033), ('edinburgh', 0.033), ('simpler', 0.033), ('matrix', 0.033), ('erence', 0.032), ('decide', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger
Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1
2 0.21258283 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
Author: Yves Grandvalet, Stéphane Canu
Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.
3 0.1711008 110 nips-2002-Incremental Gaussian Processes
Author: Joaquin Quiñonero-candela, Ole Winther
Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean field approach to the intractable classification model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.
4 0.15152149 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
Author: Sepp Hochreiter, Klaus Obermayer
Abstract: We investigate the problem of learning a classification task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to different sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classifier which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive definite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classification and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classification results. 1
5 0.13176131 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems
Author: Sepp Hochreiter, Michael C. Mozer, Klaus Obermayer
Abstract: We introduce a family of classifiers based on a physical analogy to an electrostatic system of charged conductors. The family, called Coulomb classifiers, includes the two best-known support-vector machines (SVMs), the ν–SVM and the C–SVM. In the electrostatics analogy, a training example corresponds to a charged conductor at a given location in space, the classification function corresponds to the electrostatic potential function, and the training objective function corresponds to the Coulomb energy. The electrostatic framework provides not only a novel interpretation of existing algorithms and their interrelationships, but it suggests a variety of new methods for SVMs including kernels that bridge the gap between polynomial and radial-basis functions, objective functions that do not require positive-definite kernels, regularization techniques that allow for the construction of an optimal classifier in Minkowski space. Based on the framework, we propose novel SVMs and perform simulation studies to show that they are comparable or superior to standard SVMs. The experiments include classification tasks on data which are represented in terms of their pairwise proximities, where a Coulomb Classifier outperformed standard SVMs. 1
6 0.13158178 120 nips-2002-Kernel Design Using Boosting
7 0.12843227 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
8 0.12752673 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
9 0.12342421 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
10 0.12232077 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
11 0.11708593 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
12 0.10965604 45 nips-2002-Boosted Dyadic Kernel Discriminants
13 0.10704565 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
14 0.10576769 90 nips-2002-Feature Selection in Mixture-Based Clustering
15 0.10571525 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
16 0.10173774 41 nips-2002-Bayesian Monte Carlo
17 0.10079087 156 nips-2002-On the Complexity of Learning the Kernel Matrix
18 0.10026047 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study
19 0.09385325 194 nips-2002-The Decision List Machine
20 0.093501277 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
topicId topicWeight
[(0, -0.311), (1, -0.124), (2, 0.072), (3, -0.034), (4, 0.098), (5, 0.003), (6, -0.057), (7, 0.028), (8, 0.034), (9, 0.049), (10, -0.026), (11, 0.008), (12, 0.131), (13, 0.062), (14, 0.128), (15, -0.139), (16, -0.06), (17, 0.04), (18, 0.141), (19, 0.136), (20, 0.06), (21, 0.032), (22, 0.07), (23, -0.086), (24, -0.005), (25, -0.102), (26, -0.038), (27, 0.009), (28, 0.021), (29, 0.039), (30, 0.118), (31, 0.122), (32, 0.016), (33, -0.007), (34, -0.016), (35, 0.061), (36, -0.051), (37, -0.141), (38, -0.05), (39, -0.063), (40, -0.02), (41, 0.006), (42, 0.04), (43, -0.001), (44, -0.085), (45, -0.056), (46, 0.02), (47, 0.016), (48, 0.021), (49, -0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.96391982 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger
Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1
2 0.7797879 110 nips-2002-Incremental Gaussian Processes
Author: Joaquin Quiñonero-candela, Ole Winther
Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean field approach to the intractable classification model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.
3 0.75668591 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
Author: Anton Schwaighofer, Volker Tresp, Peter Mayer, Alexander K. Scheel, Gerhard A. Müller
Abstract: We describe the RA scanner, a novel system for the examination of patients suffering from rheumatoid arthritis. The RA scanner is based on a novel laser-based imaging technique which is sensitive to the optical characteristics of finger joint tissue. Based on the laser images, finger joints are classified according to whether the inflammatory status has improved or worsened. To perform the classification task, various linear and kernel-based systems were implemented and their performances were compared. Special emphasis was put on measures to reliably perform parameter tuning and evaluation, since only a very small data set was available. Based on the results presented in this paper, it was concluded that the RA scanner permits a reliable classification of pathological finger joints, thus paving the way for a further development from prototype to product stage.
4 0.7312814 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
Author: Anton Schwaighofer, Volker Tresp
Abstract: Gaussian process regression allows a simple analytical treatment of exact Bayesian inference and has been found to provide good performance, yet scales badly with the number of training data. In this paper we compare several approaches towards scaling Gaussian processes regression to large data sets: the subset of representers method, the reduced rank approximation, online Gaussian processes, and the Bayesian committee machine. Furthermore we provide theoretical insight into some of our experimental results. We found that subset of representers methods can give good and particularly fast predictions for data sets with high and medium noise levels. On complex low noise data sets, the Bayesian committee machine achieves significantly better accuracy, yet at a higher computational cost.
5 0.72529572 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems
Author: Sepp Hochreiter, Michael C. Mozer, Klaus Obermayer
Abstract: We introduce a family of classifiers based on a physical analogy to an electrostatic system of charged conductors. The family, called Coulomb classifiers, includes the two best-known support-vector machines (SVMs), the ν–SVM and the C–SVM. In the electrostatics analogy, a training example corresponds to a charged conductor at a given location in space, the classification function corresponds to the electrostatic potential function, and the training objective function corresponds to the Coulomb energy. The electrostatic framework provides not only a novel interpretation of existing algorithms and their interrelationships, but it suggests a variety of new methods for SVMs including kernels that bridge the gap between polynomial and radial-basis functions, objective functions that do not require positive-definite kernels, regularization techniques that allow for the construction of an optimal classifier in Minkowski space. Based on the framework, we propose novel SVMs and perform simulation studies to show that they are comparable or superior to standard SVMs. The experiments include classification tasks on data which are represented in terms of their pairwise proximities, where a Coulomb Classifier outperformed standard SVMs. 1
6 0.6751045 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
7 0.64998007 194 nips-2002-The Decision List Machine
8 0.63910961 95 nips-2002-Gaussian Process Priors with Uncertain Inputs Application to Multiple-Step Ahead Time Series Forecasting
9 0.5801791 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study
10 0.5782848 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
11 0.55267435 6 nips-2002-A Formulation for Minimax Probability Machine Regression
12 0.55063975 45 nips-2002-Boosted Dyadic Kernel Discriminants
13 0.53684902 138 nips-2002-Manifold Parzen Windows
14 0.53029525 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers
15 0.52507126 55 nips-2002-Combining Features for BCI
16 0.5173251 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
17 0.49795625 41 nips-2002-Bayesian Monte Carlo
18 0.47403702 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
19 0.45571551 150 nips-2002-Multiple Cause Vector Quantization
20 0.45562181 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
topicId topicWeight
[(23, 0.018), (42, 0.055), (54, 0.101), (55, 0.032), (67, 0.011), (68, 0.019), (74, 0.065), (92, 0.036), (98, 0.591)]
simIndex simValue paperId paperTitle
1 0.99297309 129 nips-2002-Learning in Spiking Neural Assemblies
Author: David Barber
Abstract: We consider a statistical framework for learning in a class of networks of spiking neurons. Our aim is to show how optimal local learning rules can be readily derived once the neural dynamics and desired functionality of the neural assembly have been specified, in contrast to other models which assume (sub-optimal) learning rules. Within this framework we derive local rules for learning temporal sequences in a model of spiking neurons and demonstrate its superior performance to correlation (Hebbian) based approaches. We further show how to include mechanisms such as synaptic depression and outline how the framework is readily extensible to learning in networks of highly complex spiking neurons. A stochastic quantal vesicle release mechanism is considered and implications on the complexity of learning discussed. 1
2 0.9899497 103 nips-2002-How Linear are Auditory Cortical Responses?
Author: Maneesh Sahani, Jennifer F. Linden
Abstract: By comparison to some other sensory cortices, the functional properties of cells in the primary auditory cortex are not yet well understood. Recent attempts to obtain a generalized description of auditory cortical responses have often relied upon characterization of the spectrotemporal receptive field (STRF), which amounts to a model of the stimulusresponse function (SRF) that is linear in the spectrogram of the stimulus. How well can such a model account for neural responses at the very first stages of auditory cortical processing? To answer this question, we develop a novel methodology for evaluating the fraction of stimulus-related response power in a population that can be captured by a given type of SRF model. We use this technique to show that, in the thalamo-recipient layers of primary auditory cortex, STRF models account for no more than 40% of the stimulus-related power in neural responses.
same-paper 3 0.97879976 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger
Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1
4 0.97573078 92 nips-2002-FloatBoost Learning for Classification
Author: Stan Z. Li, Zhenqiu Zhang, Heung-yeung Shum, Hongjiang Zhang
Abstract: AdaBoost [3] minimizes an upper error bound which is an exponential function of the margin on the training set [14]. However, the ultimate goal in applications of pattern classification is always minimum error rate. On the other hand, AdaBoost needs an effective procedure for learning weak classifiers, which by itself is difficult especially for high dimensional data. In this paper, we present a novel procedure, called FloatBoost, for learning a better boosted classifier. FloatBoost uses a backtrack mechanism after each iteration of AdaBoost to remove weak classifiers which cause higher error rates. The resulting float-boosted classifier consists of fewer weak classifiers yet achieves lower error rates than AdaBoost in both training and test. We also propose a statistical model for learning weak classifiers, based on a stagewise approximation of the posterior using an overcomplete set of scalar features. Experimental comparisons of FloatBoost and AdaBoost are provided through a difficult classification problem, face detection, where the goal is to learn from training examples a highly nonlinear classifier to differentiate between face and nonface patterns in a high dimensional space. The results clearly demonstrate the promises made by FloatBoost over AdaBoost.
5 0.97232872 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error
Author: Luis E. Ortiz, David A. McAllester
Abstract: This paper gives distribution-free concentration inequalities for the missing mass and the error rate of histogram rules. Negative association methods can be used to reduce these concentration problems to concentration questions about independent sums. Although the sums are independent, they are highly heterogeneous. Such highly heterogeneous independent sums cannot be analyzed using standard concentration inequalities such as Hoeffding’s inequality, the Angluin-Valiant bound, Bernstein’s inequality, Bennett’s inequality, or McDiarmid’s theorem.
6 0.93094784 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
7 0.87935436 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions
8 0.84420836 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics
9 0.83868152 184 nips-2002-Spectro-Temporal Receptive Fields of Subthreshold Responses in Auditory Cortex
10 0.82814741 43 nips-2002-Binary Coding in Auditory Cortex
11 0.82510781 110 nips-2002-Incremental Gaussian Processes
12 0.8147946 12 nips-2002-A Neural Edge-Detection Model for Enhanced Auditory Sensitivity in Modulated Noise
13 0.80819887 102 nips-2002-Hidden Markov Model of Cortical Synaptic Plasticity: Derivation of the Learning Rule
14 0.78841513 41 nips-2002-Bayesian Monte Carlo
15 0.78717339 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
16 0.7850005 180 nips-2002-Selectivity and Metaplasticity in a Unified Calcium-Dependent Model
17 0.77857995 199 nips-2002-Timing and Partial Observability in the Dopamine System
18 0.77062327 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex
19 0.76455259 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences
20 0.75174761 148 nips-2002-Morton-Style Factorial Coding of Color in Primary Visual Cortex