nips nips2000 nips2000-133 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thore Graepel, Ralf Herbrich
Abstract: We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We present an algorithm that samples the hypothesis space of kernel classifiers. [sent-5, score-0.379]
2 Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). [sent-6, score-1.575]
3 The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. [sent-7, score-0.311]
4 The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. [sent-8, score-0.685]
5 For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise. [sent-9, score-0.473]
6 1 Introduction Two great ideas have dominated recent developments in machine learning: the application of kernel methods and the popularisation of Bayesian inference. [sent-10, score-0.352]
7 Focusing on the task of classification, various connections between the two areas exist: kernels have long been a part of Bayesian inference in the disguise of covariance nmctions that characterise priors over functions [9]. [sent-11, score-0.05]
8 Also, attempts have been made to re-derive the support vector machine (SVM) [1] - possibly the most prominent representative of kernel methods - as a maximum a-posteriori estimator (MAP) in a Bayesian framework [8] . [sent-12, score-0.309]
9 While this work suggests good strategies for evidencebased model selection the MAP estimator is not truly Bayesian in spirit because it is not based on the concept of model averaging which is crucial to Bayesian reasoning. [sent-13, score-0.174]
10 As a consequence, the MAP estimator is generally not as robust as a real Bayesian estimator. [sent-14, score-0.063]
11 While this drawback is inconsequential in a noise-free setting or in a situation dominated by feature noise, it may have severe consequences when the data is contaminated by label noise that may lead to a multi-modal posterior distribution. [sent-15, score-0.781]
12 In order to make use of the full Bayesian posterior distribution it is necessary to generate samples from this distribution. [sent-16, score-0.215]
13 This contribution is concerned with the generation of samples from the Bayesian posterior over the hypothesis space of lin- ear classifiers in arbitrary kernel spaces in the case of label noise. [sent-17, score-1.06]
14 In contrast to [8] we consider normalised weight vectors, IIwll. [sent-18, score-0.129]
15 ~: = 1, because the classification given by a linear classifier only depends on the spatial direction of the weight vector w and not on its length. [sent-19, score-0.205]
16 This point of view leads to a hypothesis space isomorphic to the surface of an n-dimensional sphere which - in the absence of prior information - is naturally equipped with a uniform prior over directions. [sent-20, score-0.272]
17 Incorporating the label noise model into the likelihood then leads to a piecewise constant posterior on the surface of the sphere. [sent-21, score-0.898]
18 The kernel Gibbs sampler (KGS) is designed to sample from this type of posterior by iteratively choosing a random direction and sampling on the resulting piecewise constant one-dimensional density in the fashion of a hit-and-run algorithm [7]. [sent-22, score-0.847]
19 The resulting samples can be used in various ways: i) In Bayesian transduction [3] the decision about the labels of new test points can be inferred by a majority decision of the sampled classifiers. [sent-23, score-0.576]
20 ii) The posterior mean - the Bayes point machine (BPM) solution [4] - can be calculated as an approximation to transduction. [sent-24, score-0.205]
21 iii) The binary entropy of candidate training points can be calculated to determine their information content for active learning [2]. [sent-25, score-0.204]
22 iv) The model evidence [5] can be evaluated for the purpose of model selection. [sent-26, score-0.045]
23 We would like to point out, however, that the KGS is limited in practice to a sample size of m ~ 100 and should thus be thought of as an analytical tool to advance our understanding of the interaction of kernel methods and Bayesian reasoning. [sent-27, score-0.391]
24 The paper is structured as follows: in Section 2 we introduce the learning scenario and explain our Bayesian approach to linear classifiers in kernel spaces. [sent-28, score-0.304]
25 The kernel Gibbs sampler is explained in detail in Section 3. [sent-29, score-0.426]
26 Different applications of the KGS are discussed in Section 4 followed by an experimental demonstration of the BPM solution based on using the KGS under label noise conditions. [sent-30, score-0.53]
27 X), and vector spaces by calligraphic capitalised letters (e. [sent-37, score-0.11]
28 The symbols P, E and I denote a probability measure, the expectation of a random variable and the indicator function, respectively. [sent-40, score-0.093]
29 2 Bayesian Learning in Kernel spaces We consider learning given a sequence x = (Xl, . [sent-41, score-0.068]
30 Ym) E {-I, +I} m drawn iid from a fixed distribution P XY = P z over the space X x { -1, + I} = Z of input-output pairs. [sent-47, score-0.093]
31 The hypotheses are linear classifiers X I-t (w,ifJ(x))/C =: (w,x)/C in some fixed feature space K ~ £~ where we assume that a mapping ¢ : X -+ K is chosen a priori 1 . [sent-48, score-0.13]
32 Since all we need for learning is the real-valued output (w, Xi) /C of the classifier w at the m training points in Xl, . [sent-49, score-0.209]
33 This is particularly useful if the dimensionality dim (K) = n of the feature space K is much greater (or possibly infinite) than the number m of training points. [sent-54, score-0.097]
34 From (1) we see that all that is needed is the inner product function k (x, x') = (¢ (x) ,¢ (x'))/C also known as the kernel (see [9] for a detailed introduction to the theory of kernels). [sent-55, score-0.252]
35 This, however, should not be confused with n-tuple x denoting the training objects. [sent-57, score-0.062]
36 (a) (b) Figure 1: Illustration of the (log) posterior distribution on the surface of a 3dimensional sphere {w E Il~a IlIwllK = I} resulting from a label noise model with a label flip rate of q = 0. [sent-58, score-1.178]
37 The log posterior is plotted over the longitude and latitude, and for small sample size it is multi-modal due to the label noise. [sent-60, score-0.633]
38 The classifier w* labelling the data (before label noise) was at (~, 11"). [sent-61, score-0.483]
39 In a Bayesian spirit we consider a prior Pw over possible weight vectors w E W of unit length, i. [sent-62, score-0.169]
40 Given an iid training set z = (x,y) and a likelihood model PYlx=x,w=w we obtain the posterior PWlz==z using Bayes' formula PY=lx==. [sent-65, score-0.281]
41 "w=w (y) By the iid assumption and the independence of the denominator from w we obtain m i=l . [sent-67, score-0.098]
42 c[w,z] In the absence of specific prior knowledge symmetry suggests to take Pw uniform on W. [sent-70, score-0.036]
43 Furthermore, we choose the likelihood model PY1X=x,w=w (Y) ={ if y (w,x)K otherwise i _q ::; a where q specifies the assumed level of label noise. [sent-71, score-0.429]
44 Please note the difference to the commonly assumed model of feature noise which essentially assumes noise in the (mapped) input vectors x instead of the labels y and constitutes the basis of the soft-margin SVM [1]. [sent-72, score-0.321]
45 Thus the likelihood C[w,z] of the weight vector w is given by C [w, z] = qm. [sent-73, score-0.094]
46 Re=p [w,z] (1 _ q)m(l-Re=p [W,Z]) , (3) where the training error Remp [w, z] is defined as Remp [w,z] = 1 m m L i=l IYi(W,Xi } K:~O. [sent-74, score-0.062]
47 Two data points YIXI and Y2X2 divide the space of normalised weight vectors W into four equivalence classes with different posterior density indicated by the gray shading. [sent-75, score-0.472]
48 In each iteration, starting from Wj_l a random direction v with v. [sent-76, score-0.057]
49 We sample from the piecewise constant density on the great circle determined by the plane defined by Wj-l and v. [sent-79, score-0.287]
50 In order to obtain (*, we calculate the 2m angles (i where the training samples intersect with the circle and keep track of the number m . [sent-80, score-0.257]
51 Figure 2: Schematic view of the kernel Gibbs sampling procedure. [sent-82, score-0.267]
52 Clearly, the posterior Pw1z==z is piecewise constant for all error Remp [w,z] (see Figure 1). [sent-83, score-0.254]
53 3 W with equal training The Kernel Gibbs Sampler In order to sample from PWlz==z on W we suggest a Markov Chain sampling method. [sent-84, score-0.172]
54 For a given value of q, the sampling scheme can be decomposed into the following steps (see Figure 2): 1. [sent-85, score-0.058]
55 Choose an arbitrary starting point Wo E W and set j = O. [sent-86, score-0.043]
56 Choose a direction v E W in the tangent space {v E W I (v, Wj)K = O}. [sent-88, score-0.092]
57 Calculate all m hit points b i E W from W in direction v with the hyperplane having normal YiXi' Before normalisation, this is achieved by [4] b i = Wj - (Wj,Xi)K ( ) V. [sent-90, score-0.114]
58 Calculate the training errors ei of the 2m intervals [(nCi-l),(nCi)] byevaluating ei = Hemp [cos ((nCHl)2- (nCi)) Wj - sm ((nCHl)2- (nCi)) v, z ] . [sent-113, score-0.148]
59 Sample an angle (* using the piecewise uniform distribution and (3). [sent-115, score-0.165]
60 Since the algorithm is carried out in feature space K we can use m W m = LCtiXi, i=l m v= LViXi, b = L. [sent-121, score-0.035]
61 i=l i=l For the inner products and norms it follows that, e. [sent-123, score-0.043]
62 As a consequence the above algorithm can be implemented in arbitrary kernel spaces only making use of k. [sent-126, score-0.277]
63 4 Applications of the Kernel Gibbs Sampler The kernel Gibbs sampler provides samples from the full posterior distribution over the hypothesis space of linear classifiers in kernel space for the case of label noise. [sent-127, score-1.453]
64 These samples can be used for various tasks related to learning. [sent-128, score-0.14]
65 In the following we will present a selection of these tasks. [sent-129, score-0.047]
66 Bayesian Transduction Given a sample from the posterior distribution over hypotheses, a good strategy for prediction is to let the sampled classifiers vote on each new test data point. [sent-130, score-0.352]
67 This mode of prediction is closest to the Bayesian spirit and has been shown for the zero-noise case to yield excellent generalisation performance [3]. [sent-131, score-0.286]
68 Also the fraction of votes for the majority decision is an excellent indicator for the reliability of the final estimate: Rejection of those test points with the closest decision results in a great reduction of the generalisation error on the remaining test points x. [sent-132, score-0.641]
69 Given the posterior PWlz~=% the transductive decision is BT% (x) = sign (Ewlz~=% [sign ((W,x)x;)J) . [sent-133, score-0.253]
70 In practice, this estimator is approximated by replacing the expectation by a sum over the sampled weight vectors W j. [sent-134, score-0.288]
71 (4) EWlz~=% Bayes Point Machines For classification, Bayesian Transduction requires the whole collection of sampled weight vectors W in memory. [sent-135, score-0.185]
72 Since this may be impractical for large data sets we would like to derive a single classifier W from the Bayesian posterior. [sent-136, score-0.09]
73 An excellent approximation of the transductive decision BT% (x) by a single classifier is obtained by exchanging the expectation with the inner sign-function in (4). [sent-137, score-0.362]
74 Again, wbp is estimated by replacing the expectation by the mean over samples W j. [sent-139, score-0.256]
75 Note that there exists no SVM equivalence WSVM to the Bayes point Wbp in the case of label noise - a fact to be elaborated on in the experimental part in Section 5. [sent-140, score-0.616]
76 2 Figure 3: A set of 50 samples Wj of the posterior PWlz~=z for various noise levels q. [sent-144, score-0.441]
77 Shown are the resulting decision boundaries in data space X. [sent-145, score-0.144]
78 Active Learning The Bayesian posterior can also be employed to determine the usefulness of candidate training points - a task that can be considered as a dual counterpart to Bayesian Transduction. [sent-146, score-0.291]
79 This is particularly useful when the label y of a training point x is more expensive to obtain than the training point x itself. [sent-147, score-0.603]
80 It was shown in the context of "Query by Committee" [2) that the binary entropy S (x,z) = p+ log2P+ + p-Iog2P- with p± = PWlz~=z (± (W, x) K > 0) is an indicator of the information content of a data point x with regard to the learning task. [sent-148, score-0.134]
81 Samples W j from the Bayesian posterior PWlz~=z make it possible to estimate S for a given candidate training points x and the current training set z to decide on the basis of S if it is worthwhile to query the corresponding label y . [sent-149, score-0.801]
82 It turns out that in the zero-noise case the margin (the quantity maximised by the SVM) is a measure of the evidence of the model used [4) . [sent-151, score-0.082]
83 In the case of label noise the KGS serves to estimate this quantity. [sent-152, score-0.53]
84 5 Experiments In a first experiment we used a surrogate dataset of m = 76 data points x in X = IR2 and the kernel k (x,x') = exp(-t Ilx - x'II~). [sent-153, score-0.266]
85 Using the KGS we sampled 50 different classifiers with weight vectors W j for various noise levels q and plotted the resulting decision boundaries {x E IR2 I (w j, x) K = O} in Figure 3 (circles and crosses depict different classes). [sent-154, score-0.615]
86 As can be seen form these plots, increasing the noise level q leads to more diverse classifiers on the training set z. [sent-155, score-0.331]
87 In a second experiment we investigated the generalisation performance of the Bayes point machine (see (5)) in the case of label noise. [sent-156, score-0.596]
88 In IR3 we generated 100 random training and test sets of size mtrain = 100 and mtest = 1000, respectively. [sent-157, score-0.062]
89 For each normalised point x E IR3 the longitude and latitude were sampled from a Beta(5, 5) and Beta(O. [sent-158, score-0.32]
90 The classes y were obtained by randomly flipping the classes assigned by the classifier w* at (~, 7r) (see also Figure 1) with a true label flip rate of q* = 5%. [sent-161, score-0.609]
91 In Figure 4 we plotted the estimated generalisation error for a BPM (trained using 100 samples Wj from the KGS) and Generalisation errors of BPMs (circled error-bars) and soft-margin SVMs (triangled error-bars) vs. [sent-162, score-0.213]
92 assumed noise level q and margin slack penalisation A, respectively. [sent-163, score-0.234]
93 The dataset consisted of m = 100 observations with a label noise of 5% (dotted line) and we used k(x,x') = (x,x')x+A·I"=,,,. [sent-164, score-0.53]
94 Figure 4: Comparison of BPMs and SVMs on data contaminated by label noise. [sent-166, score-0.474]
95 quadratic soft-margin SVM at different label noise levels q and margin slack penalisation A, respectively. [sent-167, score-0.666]
96 Clearly, the BPM with the correct noise model outperformed the SVM irrespective of the chosen level of regularisation. [sent-168, score-0.137]
97 6 Conclusion and Future Research The kernel Gibbs sampler provides an analytical tool for the exploration of various Bayesian aspects of learning in kernel spaces. [sent-173, score-0.806]
98 It provides a well-founded way for dealing with label noise but suffers from its computational complexity which - so far - makes it inapplicable for large scale applications. [sent-174, score-0.53]
99 Therefore it will be an interesting topic for future research to invent new sampling schemes that may be able to trade accuracy for speed and would thus be applicable to large data sets. [sent-175, score-0.058]
wordName wordTfidf (topN-words)
[('label', 0.393), ('kgs', 0.346), ('pwlz', 0.22), ('sampler', 0.217), ('kernel', 0.209), ('bayesian', 0.188), ('wj', 0.165), ('noise', 0.137), ('bpm', 0.136), ('gibbs', 0.134), ('piecewise', 0.129), ('wbp', 0.126), ('posterior', 0.125), ('generalisation', 0.123), ('transduction', 0.108), ('nci', 0.108), ('classifiers', 0.095), ('samples', 0.09), ('classifier', 0.09), ('svm', 0.084), ('berlin', 0.082), ('contaminated', 0.081), ('remp', 0.081), ('sampled', 0.08), ('bayes', 0.079), ('decision', 0.074), ('normalised', 0.071), ('graepel', 0.068), ('pw', 0.068), ('spaces', 0.068), ('spirit', 0.064), ('herbrich', 0.064), ('estimator', 0.063), ('beta', 0.063), ('ewlz', 0.063), ('latitude', 0.063), ('longitude', 0.063), ('nchl', 0.063), ('penalisation', 0.063), ('training', 0.062), ('great', 0.061), ('excellent', 0.061), ('calculate', 0.06), ('weight', 0.058), ('sampling', 0.058), ('iid', 0.058), ('points', 0.057), ('direction', 0.057), ('query', 0.055), ('transductive', 0.054), ('bpms', 0.054), ('wsvm', 0.054), ('flip', 0.054), ('indicator', 0.053), ('sample', 0.052), ('various', 0.05), ('yixi', 0.049), ('committee', 0.049), ('candidate', 0.047), ('selection', 0.047), ('vectors', 0.047), ('tool', 0.046), ('circle', 0.045), ('py', 0.045), ('dominated', 0.045), ('hypothesis', 0.045), ('evidence', 0.045), ('ei', 0.043), ('inner', 0.043), ('point', 0.043), ('madison', 0.043), ('wisconsin', 0.043), ('majority', 0.043), ('equivalence', 0.043), ('letters', 0.042), ('technical', 0.042), ('surface', 0.041), ('analytical', 0.041), ('denominator', 0.04), ('bt', 0.04), ('bold', 0.04), ('reproducing', 0.04), ('expectation', 0.04), ('levels', 0.039), ('lx', 0.038), ('germany', 0.038), ('closest', 0.038), ('content', 0.038), ('leads', 0.037), ('machine', 0.037), ('quantity', 0.037), ('uniform', 0.036), ('likelihood', 0.036), ('chain', 0.036), ('classes', 0.036), ('boundaries', 0.035), ('sphere', 0.035), ('space', 0.035), ('slack', 0.034), ('exploration', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 133 nips-2000-The Kernel Gibbs Sampler
Author: Thore Graepel, Ralf Herbrich
Abstract: We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise. 1
2 0.35135087 75 nips-2000-Large Scale Bayes Point Machines
Author: Ralf Herbrich, Thore Graepel
Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1
3 0.253598 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
Author: Ralf Herbrich, Thore Graepel
Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1
4 0.1842722 58 nips-2000-From Margin to Sparsity
Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson
Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1
5 0.14363964 130 nips-2000-Text Classification using String Kernels
Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins
Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1
6 0.1348311 121 nips-2000-Sparse Kernel Principal Component Analysis
7 0.12854181 134 nips-2000-The Kernel Trick for Distances
8 0.12640984 74 nips-2000-Kernel Expansions with Unlabeled Examples
9 0.11839787 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
10 0.11708416 111 nips-2000-Regularized Winnow Methods
11 0.1122117 122 nips-2000-Sparse Representation for Gaussian Process Models
12 0.11065458 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach
13 0.11004305 35 nips-2000-Computing with Finite and Infinite Networks
14 0.10668965 54 nips-2000-Feature Selection for SVMs
15 0.10331471 4 nips-2000-A Linear Programming Approach to Novelty Detection
16 0.10314618 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
17 0.10286661 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
18 0.099140488 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
19 0.095521756 120 nips-2000-Sparse Greedy Gaussian Process Regression
20 0.094795763 37 nips-2000-Convergence of Large Margin Separable Linear Classification
topicId topicWeight
[(0, 0.313), (1, 0.287), (2, -0.077), (3, 0.056), (4, -0.074), (5, 0.025), (6, -0.004), (7, 0.054), (8, 0.057), (9, -0.117), (10, 0.288), (11, -0.037), (12, -0.039), (13, -0.045), (14, -0.145), (15, 0.025), (16, -0.125), (17, -0.051), (18, 0.068), (19, 0.041), (20, -0.17), (21, 0.082), (22, 0.015), (23, -0.069), (24, 0.007), (25, 0.049), (26, -0.016), (27, 0.122), (28, 0.051), (29, -0.013), (30, 0.052), (31, 0.008), (32, -0.028), (33, -0.011), (34, -0.014), (35, -0.023), (36, -0.02), (37, -0.029), (38, 0.046), (39, 0.122), (40, 0.017), (41, -0.035), (42, -0.014), (43, -0.048), (44, -0.007), (45, 0.034), (46, 0.046), (47, 0.077), (48, -0.028), (49, -0.131)]
simIndex simValue paperId paperTitle
same-paper 1 0.95752907 133 nips-2000-The Kernel Gibbs Sampler
Author: Thore Graepel, Ralf Herbrich
Abstract: We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise. 1
2 0.86894506 75 nips-2000-Large Scale Bayes Point Machines
Author: Ralf Herbrich, Thore Graepel
Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1
3 0.6938895 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
Author: Ralf Herbrich, Thore Graepel
Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1
4 0.60904551 58 nips-2000-From Margin to Sparsity
Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson
Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1
5 0.47527534 74 nips-2000-Kernel Expansions with Unlabeled Examples
Author: Martin Szummer, Tommi Jaakkola
Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.
6 0.44568428 130 nips-2000-Text Classification using String Kernels
7 0.43640596 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach
8 0.40044504 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
9 0.39040875 52 nips-2000-Fast Training of Support Vector Classifiers
10 0.38271594 121 nips-2000-Sparse Kernel Principal Component Analysis
11 0.37579605 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
12 0.37359756 134 nips-2000-The Kernel Trick for Distances
13 0.3681677 27 nips-2000-Automatic Choice of Dimensionality for PCA
14 0.36602113 35 nips-2000-Computing with Finite and Infinite Networks
15 0.36498642 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
16 0.35432243 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
17 0.35324866 122 nips-2000-Sparse Representation for Gaussian Process Models
18 0.35251626 70 nips-2000-Incremental and Decremental Support Vector Machine Learning
19 0.34454775 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
20 0.34113356 4 nips-2000-A Linear Programming Approach to Novelty Detection
topicId topicWeight
[(10, 0.046), (17, 0.151), (32, 0.031), (33, 0.063), (36, 0.02), (45, 0.088), (55, 0.019), (62, 0.037), (65, 0.013), (67, 0.053), (75, 0.011), (76, 0.042), (79, 0.036), (81, 0.011), (90, 0.062), (97, 0.025), (98, 0.231)]
simIndex simValue paperId paperTitle
same-paper 1 0.8398152 133 nips-2000-The Kernel Gibbs Sampler
Author: Thore Graepel, Ralf Herbrich
Abstract: We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise. 1
2 0.77274418 79 nips-2000-Learning Segmentation by Random Walks
Author: Marina Meila, Jianbo Shi
Abstract: We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1
3 0.68539083 75 nips-2000-Large Scale Bayes Point Machines
Author: Ralf Herbrich, Thore Graepel
Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1
4 0.66678315 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates
Author: Trausti T. Kristjansson, Brendan J. Frey
Abstract: Condensation, a form of likelihood-weighted particle filtering, has been successfully used to infer the shapes of highly constrained
5 0.62837225 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
Author: Ralf Herbrich, Thore Graepel
Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1
6 0.60323364 74 nips-2000-Kernel Expansions with Unlabeled Examples
7 0.59861112 4 nips-2000-A Linear Programming Approach to Novelty Detection
8 0.5934397 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
9 0.58583444 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
10 0.58503026 130 nips-2000-Text Classification using String Kernels
11 0.58469266 111 nips-2000-Regularized Winnow Methods
12 0.58408201 37 nips-2000-Convergence of Large Margin Separable Linear Classification
13 0.58032882 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
14 0.57835865 21 nips-2000-Algorithmic Stability and Generalization Performance
15 0.57482946 122 nips-2000-Sparse Representation for Gaussian Process Models
16 0.57344121 60 nips-2000-Gaussianization
17 0.57287246 36 nips-2000-Constrained Independent Component Analysis
18 0.57073349 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
19 0.56928384 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
20 0.56777579 52 nips-2000-Fast Training of Support Vector Classifiers