nips nips2000 nips2000-75 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ralf Herbrich, Thore Graepel
Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. [sent-5, score-0.241]
2 Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. [sent-6, score-1.341]
3 However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . [sent-7, score-0.233]
4 m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. [sent-8, score-0.187]
5 In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. [sent-9, score-0.259]
6 The method is algorithmically simple and is easily extended to the multi-class case. [sent-10, score-0.054]
7 We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. [sent-11, score-0.262]
8 In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. [sent-12, score-0.109]
9 Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e. [sent-13, score-0.463]
10 1% generalisation error for a given rejection rate of 10%. [sent-16, score-0.687]
11 1 Introduction Kernel machines have recently gained a lot of attention due to the popularisation of the support vector machine (SVM) [13] with a focus on classification and the revival of Gaussian Processes (GP) for regression [15]. [sent-17, score-0.203]
12 Subsequently, SVMs have been modified to handle regression [12] and GPs have been adapted to the problem of classification [8]. [sent-18, score-0.085]
13 Both schemes essentially work in the same function space that is characterised by kernels (SVM) and covariance functions (GP), respectively. [sent-19, score-0.077]
14 The SVM was inspired by results from statistical/PAC learning theory while GPs are usually considered in a Bayesian framework. [sent-21, score-0.039]
15 This ideological clash can be viewed as a continuation in machine learning of the by now classical disagreement between Bayesian and frequentistic statistics. [sent-22, score-0.105]
16 Of course there exists a strong relationship between the two ideas, in particular with the Bayesian maximum a posteriori (MAP) estimator being the solution of an optimisation problem. [sent-24, score-0.106]
17 Interestingly, the two viewpoints have recently been reconciled theoretically in the so-called PAC-Bayesian framework [5] that combines the idea of a Bayesian prior with PAC-style performance guarantees and has been the basis of the so far tightest margin bound for SVMs [3]. [sent-25, score-0.218]
18 In practice, optimisation based algorithms have the advantage of a unique, deterministic solution and the availability of the cost function as an indicator for the quality of the solution. [sent-26, score-0.205]
19 In contrast, Bayesian algorithms based on sampling and voting are more flexible and have the so-called "anytime" property, providing a relatively good solution at any point in time. [sent-27, score-0.181]
20 Often, however, they suffer from the computational costs of sampling the Bayesian posterior. [sent-28, score-0.12]
21 In this contribution we review the idea of the Bayes point machine (BPM) as an approximation to Bayesian inference for linear classifiers in kernel space in Section 2. [sent-29, score-0.476]
22 In contrast to the GP viewpoint we do not define a Gaussian prior on the length Ilwllx: of the weight vector. [sent-30, score-0.09]
23 Instead, we only consider weight vectors of length Ilwllx: = 1 because it is only the spatial direction of the weight vector that matters for classification. [sent-31, score-0.039]
24 It is then natural to define a uniform prior on the resulting ballshaped hypothesis space. [sent-32, score-0.083]
25 Hence, we determine the centre of mass ("Bayes point") of the resulting posterior that is uniform in version space, i. [sent-33, score-0.285]
26 While the version space could be sampled using some form of Gibbs sampling (see, e. [sent-36, score-0.219]
27 [6] for an overview) or an ergodic dynamic system such as a billiard [4] we suggest to use the perceptron algorithm trained on permutations of the training set for sampling in Section 3. [sent-38, score-0.612]
28 This extremely simple sampling scheme proves to be efficient enough to make the BPM applicable to large data sets. [sent-39, score-0.085]
29 X) and vector spaces by calligraphic capitalised letters (e. [sent-50, score-0.101]
30 The symbols P, E and I denote a probability measure, the expectation of a random variable and the indicator function, respectively. [sent-53, score-0.039]
31 2 Bayes Point Machines Let us consider the task of classifying patterns X E X into one of the two classes y E Y = {-1, + 1} using functions h : X ~ Y from a given set 1t known as the hypothesis space. [sent-54, score-0.097]
32 If all that is needed for learning and classification are the inner products (. [sent-56, score-0.083]
33 )x: in the feature space K, it is convenient to specify ¢ only by its inner product function 1 For notational convenience we shall abbreviate cf> (x) by x. [sent-58, score-0.091]
34 This should not be confused with the set x of training points. [sent-59, score-0.076]
35 For simplicity, let us assume that there exists a classifier 2 w* E W that labels all our data, i. [sent-63, score-0.062]
36 (2) This assumption can easily be relaxed by introducing slack variables as done in the soft margin variant of the SVM. [sent-67, score-0.085]
37 Then given a training set z = (x, y) of m points Xi together with their classes Yi assigned by hw' drawn iid from an unknown data distribution P z = PYIXP X we can assume the existence of a version space V (z), i. [sent-68, score-0.269]
38 the set of all classifiers w E W consistent with z: (3) In a Bayesian spirit we incorporate all of our prior knowledge about w* into a prior distribution Pw over W. [sent-70, score-0.374]
39 In the absence of any a priori knowledge we suggest a uniform prior over the spatial direction of weight vectors w. [sent-71, score-0.125]
40 Now, given the training set z we update our prior belief by Bayes' formula, i. [sent-72, score-0.119]
41 The Bayesian classification of a novel test point x is then given by = argmaxyEy Pw1zm=z ({hw (x) = y}) = = Bayes z (x) sign (EWlzm=z [hw (x)]) sign (Ew1zm=z [sign ((x, W}dD Unfortunately, the strategy Bayes z is in general not contained in the set 1-l of classifiers considered beforehand. [sent-75, score-0.53]
42 Since Pw1zm=z is only non-zero inside version space, it has been suggested to use the centre of mass w crn as an approximation for Bayes z , i. [sent-76, score-0.314]
43 wcrn sign (Ew1zm=z [(x, W}JCl) sign ((x, wcrn}d , EWlzm=z [W] . [sent-78, score-0.24]
44 In a previous work [4] we calculated Wcrn using a first order Markov chain based on a billiard-like algorithm (see also [10]). [sent-80, score-0.045]
45 We entered the version space V (z) using a perceptron algorithm and started playing billiards in version space V (z) thus creating a sequence of pseudo-random samples Wi due to the chaotic nature of the billiard dynamics. [sent-81, score-0.82]
46 Playing billiards in V (z) is possible because each training point (Xi, Yi) E z defines a hyperplane {w E W I Yi (Xi, w}JC = O} ~ W. [sent-82, score-0.199]
47 Hence, the version space is a convex polyhedron on the surface of W. [sent-83, score-0.134]
48 After N bounces of the billiard ball the Bayes point was estimated by 1 N \V crn =N LWi. [sent-84, score-0.262]
49 and w E W a classifier because there is a one-to-one correspondence between the two by virtue of (1) . [sent-86, score-0.062]
50 Although this algorithm shows excellent generalisation performance when compared to state-of-the art learning algorithms like support vector machines (SVM) [13], its effort scales like 0 (m 2 ) and 0 (N . [sent-87, score-0.683]
51 m2 ) in terms of memory and computational requirements, respectively. [sent-88, score-0.084]
52 3 Sampling the Version Space Clearly, all we need for estimating the Bayes point (4) is a set of classifiers W drawn uniformly from V (z). [sent-89, score-0.295]
53 In order to save computational resources it might be advantageous to achieve a uniform sample only approximately. [sent-90, score-0.116]
54 The classical perceptron learning algorithm offers the possibility to obtain up to m! [sent-91, score-0.325]
55 different classifiers in version space simply by learning on different permutations of the training set. [sent-92, score-0.54]
56 , m} the perceptron algorithm works as follows: 1. [sent-99, score-0.22]
57 A classical theorem due to Novikoff [7] guarantees the convergence of this procedure and furthermore provides an upper bound on the number t of mistakes needed until convergence. [sent-114, score-0.152]
58 More precisely, if there exists a classifier WSVM with margin 'Y% (WSVM ) . [sent-115, score-0.147]
59 then the number of mistakes until convergence - which is an upper bound on the sparsity of the solution - is not more than R2 (x) y;2 (WSVM), where R (x) is the smallest real number such that Vx Ex: II ¢ (x) II K. [sent-118, score-0.175]
60 The quantity 'Y% (WSVM) is maximised for the solution WSVM found by the SVM, and whenever the SVM is theoretically justified by results from learning theory (see [11, 13]) the ratio d = R2 (x) 'Y;2 (WSVM) is considerably less than m, say d« m. [sent-120, score-0.135]
61 Algorithmically, we can benefit from this sparsity by the following "trick": since m W= 2: QiXi i=l all we need to store is the m-dimensional vector o. [sent-121, score-0.086]
62 Furthermore, we keep track of the m-dimensional vector 0 of real valued outputs m 0i = Yi (Xi, Wt)K. [sent-122, score-0.039]
63 = 2: Qjk (Xi, Xj) j=l of the current solution at the i-th training point. [sent-123, score-0.118]
64 Now, if 0i :::; 0 we update Qi by Qi +Yi and update 0 by OJ ~ OJ +Yik (Xi, Xj) which requires only m kernel calculations. [sent-125, score-0.14]
65 In summary, the memory requirement of this algorithm is 2m and the number of kernel calculations is not more than d·m. [sent-126, score-0.292]
66 As a consequence, the computational requirement of this algorithm is no more than the computational requirement for the evaluation ofthe margin 'Y% (WSVM)! [sent-127, score-0.316]
67 We suggest to use this efficient perceptron learning algorithm in order to obtain samples Wi for the computation of the Bayes point by (4). [sent-128, score-0.429]
68 (a) (b) (c) Figure 1: (a) Histogram of generalisation errors (estimated on a test set) using a kernel Gibbs sampler. [sent-129, score-0.624]
69 (b) Histogram of generalisation errors (estimated on a test set) using a kernel perceptron. [sent-130, score-0.624]
70 In order to investigate the usefulness of this approach experimentally, we compared the distribution of generalisation errors of samples obtained by perceptron learning on permuted training sets (as suggested earlier by [14]) with samples obtained by a full Gibbs sampling [2]. [sent-133, score-1.059]
71 For computational reasons, we used only 188 training patterns and 453 test patterns of the classes "I" and "2" from the MNIST data set 3 . [sent-134, score-0.301]
72 In Figure 1 (a) and (b) we plotted the distribution over 1000 random samples using the kernel 4 k(x,x') = «(x,x'h+1)5 . [sent-135, score-0.214]
73 (5) Using a quantile-quantile (QQ) plot technique we can compare both distributions in one graph (see Figure 1 (c)). [sent-136, score-0.052]
74 These plots suggest that by simple permutation of the training set we are able to obtain a sample of classifiers exhibiting the same generalisation error distribution as with time-consuming Gibbs sampling. [sent-137, score-0.822]
75 4 Experimental Results In our large scale experiment we used the full MNIST data set with 60000 training examples and 10000 test examples of 28 x 28 grey value images of handwritten digits. [sent-138, score-0.224]
76 As input vector x we used the 784 dimensional vector of grey values. [sent-139, score-0.117]
77 The images were labelled by one of the ten classes "0" to "I". [sent-140, score-0.111]
78 , 9} we ran the perceptron algorithm N = 10 times each time labelling all training points of class y by + 1 and the remaining training points by -1. [sent-144, score-0.372]
79 For the classification of a test image x we calculated the real-valued output of all 100 different classifiers5 by Ii (x) = where we used the kernel k given by (5). [sent-146, score-0.239]
80 (Oi)j refers to the expansion coefficient corresponding to the i- th classifier and the j - th data point. [sent-147, score-0.062]
81 4We decided to use this kernel because it showed excellent generalisation performance when using the support vector machine. [sent-152, score-0.695]
82 5For notational simplicity we assume that the first N classifiers are classifiers for the class "0", the next N for class "1" and so on. [sent-153, score-0.532]
83 3 available OOB 004 rejection rate 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 010 rejection rate generalisation error 1. [sent-154, score-0.958]
84 11% Figure 2: Generalisation error as a function of the rejection rate for the MNIST data set. [sent-165, score-0.317]
85 Note that by rejection based on the real-valued output the generalisation error could be reduced to 0. [sent-169, score-0.687]
86 1% indicating that this measure is related to the probability of misclassification of single test points. [sent-170, score-0.055]
87 ten classes we calculated the real-valued decision of the Bayes point 1 ibp ,y (x) =N Wy by N L: ii+yN (x) . [sent-171, score-0.373]
88 Note that ibp ,y (x) [9] can be interpreted as an (unnormalised) approximation of the posterior probability that x is of class y when restricted to the function class (1). [sent-176, score-0.246]
89 In order to test the dependence of the generalisation error on the magnitude max y ibp ,y (x) we fixed a certain rejection rate r E [0,1] and rejected the set of r· 10000 test points with the smallest value of maxy ibp ,y (x). [sent-177, score-1.213]
90 As can be seen from this plot, even without rejection the Bayes point has excellent generalisation performance6 . [sent-179, score-0.806]
91 Furthermore, rejection based on the real-valued output ibp (x) turns out to be excellent thus reducing the generalisation error to 0. [sent-180, score-1.006]
92 One should also bear in mind that the learning time for this simple algorithm was comparable to that of SVMs. [sent-182, score-0.084]
93 A very advantageous feature of our approach as compared to SVMs are its adjustable time and memory requirements and the "anytime" availability of a solution due to sampling. [sent-183, score-0.192]
94 If the training set grows further and we are not able to spend more time with learning, we can adjust the number N of samples used at the price of slightly worse generalisation error. [sent-184, score-0.52]
95 5 Conclusion In this paper we have presented an algorithm for approximating the Bayes point by rerunning the classical perceptron algorithm with a permuted training set. [sent-185, score-0.53]
96 particularly exploited the sparseness of the solution which must exist whenever the success of the SVM is theoretically justified. [sent-189, score-0.096]
97 The restriction to the zero training error case can be overcome by modifying the kernel as k>. [sent-190, score-0.262]
98 Another interesting question raised by our experimental findings is the following: By how much is the distribution of generalisation errors over random samples from version space related to the distribution of generalisation errors of the up to m! [sent-194, score-1.066]
99 A PAC-Bayesian margin bound for linear classifiers: Why SVMs work. [sent-211, score-0.121]
100 Prediction with Gaussian Processes: From linear regression to linear prediction and beyond. [sent-274, score-0.041]
wordName wordTfidf (topN-words)
[('generalisation', 0.37), ('rejection', 0.271), ('classifiers', 0.241), ('wsvm', 0.209), ('ibp', 0.208), ('bayes', 0.206), ('perceptron', 0.175), ('kernel', 0.14), ('billiard', 0.139), ('svm', 0.132), ('mnist', 0.118), ('bayesian', 0.112), ('excellent', 0.111), ('ilwllx', 0.104), ('wcrn', 0.104), ('yii', 0.104), ('pw', 0.1), ('version', 0.093), ('yi', 0.091), ('bpm', 0.09), ('sampling', 0.085), ('margin', 0.085), ('wt', 0.085), ('gibbs', 0.082), ('hw', 0.081), ('training', 0.076), ('graepel', 0.075), ('xii', 0.075), ('samples', 0.074), ('herbrich', 0.071), ('anytime', 0.069), ('billiards', 0.069), ('crn', 0.069), ('ewlzm', 0.069), ('permuted', 0.069), ('sign', 0.068), ('classical', 0.066), ('svms', 0.066), ('optimisation', 0.064), ('gp', 0.064), ('letters', 0.062), ('classifier', 0.062), ('jc', 0.06), ('availability', 0.06), ('bpms', 0.06), ('qq', 0.06), ('errors', 0.059), ('classes', 0.059), ('centre', 0.058), ('requirement', 0.058), ('mass', 0.056), ('test', 0.055), ('berlin', 0.054), ('handwritten', 0.054), ('theoretically', 0.054), ('algorithmically', 0.054), ('point', 0.054), ('ten', 0.052), ('plot', 0.052), ('mistakes', 0.05), ('permutations', 0.05), ('notational', 0.05), ('playing', 0.05), ('gps', 0.05), ('memory', 0.049), ('sparsity', 0.047), ('spirit', 0.047), ('viewpoint', 0.047), ('permutation', 0.047), ('error', 0.046), ('algorithm', 0.045), ('bold', 0.045), ('classification', 0.044), ('machines', 0.044), ('prior', 0.043), ('solution', 0.042), ('suggest', 0.042), ('space', 0.041), ('regression', 0.041), ('advantageous', 0.041), ('oj', 0.041), ('uniform', 0.04), ('xi', 0.04), ('pages', 0.039), ('grey', 0.039), ('indicator', 0.039), ('vector', 0.039), ('group', 0.039), ('learning', 0.039), ('posterior', 0.038), ('suggested', 0.038), ('patterns', 0.038), ('bound', 0.036), ('digits', 0.036), ('community', 0.036), ('schemes', 0.036), ('computational', 0.035), ('support', 0.035), ('williamson', 0.035), ('processes', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 75 nips-2000-Large Scale Bayes Point Machines
Author: Ralf Herbrich, Thore Graepel
Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1
2 0.42771342 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
Author: Ralf Herbrich, Thore Graepel
Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1
3 0.41141436 58 nips-2000-From Margin to Sparsity
Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson
Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1
4 0.35135087 133 nips-2000-The Kernel Gibbs Sampler
Author: Thore Graepel, Ralf Herbrich
Abstract: We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise. 1
5 0.14304066 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
Author: Claudio Gentile
Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.
6 0.14134574 122 nips-2000-Sparse Representation for Gaussian Process Models
7 0.13562509 130 nips-2000-Text Classification using String Kernels
8 0.12596858 35 nips-2000-Computing with Finite and Infinite Networks
9 0.12249698 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
10 0.12119053 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
11 0.11698882 134 nips-2000-The Kernel Trick for Distances
12 0.11503868 111 nips-2000-Regularized Winnow Methods
13 0.11351399 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
14 0.11082613 138 nips-2000-The Use of Classifiers in Sequential Inference
15 0.10935637 121 nips-2000-Sparse Kernel Principal Component Analysis
16 0.10887054 54 nips-2000-Feature Selection for SVMs
17 0.10674567 74 nips-2000-Kernel Expansions with Unlabeled Examples
18 0.10509069 37 nips-2000-Convergence of Large Margin Separable Linear Classification
19 0.099167928 4 nips-2000-A Linear Programming Approach to Novelty Detection
20 0.094695941 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
topicId topicWeight
[(0, 0.355), (1, 0.381), (2, -0.161), (3, 0.034), (4, -0.104), (5, -0.117), (6, 0.038), (7, 0.048), (8, 0.094), (9, -0.135), (10, 0.319), (11, -0.081), (12, -0.072), (13, -0.025), (14, -0.204), (15, 0.047), (16, -0.14), (17, -0.047), (18, 0.101), (19, 0.103), (20, -0.106), (21, 0.013), (22, -0.058), (23, -0.015), (24, 0.032), (25, 0.054), (26, -0.094), (27, 0.163), (28, 0.063), (29, -0.032), (30, 0.005), (31, 0.024), (32, -0.001), (33, -0.074), (34, -0.031), (35, 0.027), (36, -0.018), (37, -0.002), (38, 0.079), (39, 0.012), (40, 0.012), (41, 0.01), (42, -0.011), (43, 0.01), (44, 0.033), (45, 0.041), (46, -0.042), (47, -0.006), (48, -0.001), (49, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.95133436 75 nips-2000-Large Scale Bayes Point Machines
Author: Ralf Herbrich, Thore Graepel
Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1
2 0.86303496 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
Author: Ralf Herbrich, Thore Graepel
Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1
3 0.8363446 133 nips-2000-The Kernel Gibbs Sampler
Author: Thore Graepel, Ralf Herbrich
Abstract: We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise. 1
4 0.79222524 58 nips-2000-From Margin to Sparsity
Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson
Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1
5 0.45606017 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
Author: Claudio Gentile
Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.
6 0.45599732 138 nips-2000-The Use of Classifiers in Sequential Inference
7 0.39589104 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
8 0.38346899 130 nips-2000-Text Classification using String Kernels
9 0.35246018 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
10 0.34717718 35 nips-2000-Computing with Finite and Infinite Networks
11 0.34124258 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
12 0.33947098 122 nips-2000-Sparse Representation for Gaussian Process Models
13 0.33154735 111 nips-2000-Regularized Winnow Methods
14 0.32145971 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach
15 0.31445804 120 nips-2000-Sparse Greedy Gaussian Process Regression
16 0.30929458 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
17 0.30769336 74 nips-2000-Kernel Expansions with Unlabeled Examples
18 0.30662102 52 nips-2000-Fast Training of Support Vector Classifiers
19 0.30403173 54 nips-2000-Feature Selection for SVMs
20 0.30173749 37 nips-2000-Convergence of Large Margin Separable Linear Classification
topicId topicWeight
[(10, 0.054), (17, 0.119), (32, 0.032), (33, 0.095), (36, 0.016), (45, 0.305), (54, 0.017), (55, 0.018), (62, 0.04), (65, 0.018), (67, 0.044), (76, 0.052), (79, 0.014), (81, 0.013), (90, 0.058), (97, 0.028)]
simIndex simValue paperId paperTitle
1 0.83868307 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates
Author: Trausti T. Kristjansson, Brendan J. Frey
Abstract: Condensation, a form of likelihood-weighted particle filtering, has been successfully used to infer the shapes of highly constrained
same-paper 2 0.82756859 75 nips-2000-Large Scale Bayes Point Machines
Author: Ralf Herbrich, Thore Graepel
Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1
3 0.6311655 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
Author: Ralf Herbrich, Thore Graepel
Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1
4 0.59601945 133 nips-2000-The Kernel Gibbs Sampler
Author: Thore Graepel, Ralf Herbrich
Abstract: We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise. 1
5 0.58223379 125 nips-2000-Stability and Noise in Biochemical Switches
Author: William Bialek
Abstract: Many processes in biology, from the regulation of gene expression in bacteria to memory in the brain, involve switches constructed from networks of biochemical reactions. Crucial molecules are present in small numbers, raising questions about noise and stability. Analysis of noise in simple reaction schemes indicates that switches stable for years and switchable in milliseconds can be built from fewer than one hundred molecules. Prospects for direct tests of this prediction, as well as implications, are discussed. 1
6 0.51433241 58 nips-2000-From Margin to Sparsity
7 0.51077193 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
8 0.48876294 74 nips-2000-Kernel Expansions with Unlabeled Examples
9 0.48842919 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
10 0.48655462 37 nips-2000-Convergence of Large Margin Separable Linear Classification
11 0.48308551 111 nips-2000-Regularized Winnow Methods
12 0.48119769 21 nips-2000-Algorithmic Stability and Generalization Performance
13 0.47906303 4 nips-2000-A Linear Programming Approach to Novelty Detection
14 0.47282159 122 nips-2000-Sparse Representation for Gaussian Process Models
15 0.47088009 36 nips-2000-Constrained Independent Component Analysis
16 0.4680537 52 nips-2000-Fast Training of Support Vector Classifiers
17 0.46763751 60 nips-2000-Gaussianization
18 0.46678746 130 nips-2000-Text Classification using String Kernels
19 0.46658427 94 nips-2000-On Reversing Jensen's Inequality
20 0.46629226 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks