nips nips2000 nips2000-58 knowledge-graph by maker-knowledge-mining

58 nips-2000-From Margin to Sparsity


Source: pdf

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract We present an improvement of Novikoff's perceptron convergence theorem. [sent-7, score-0.446]

2 Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. [sent-8, score-2.023]

3 The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. [sent-9, score-0.74]

4 Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. [sent-10, score-0.413]

5 1 Introduction In the last few years there has been a large controversy about the significance of the attained margin, i. [sent-11, score-0.111]

6 the smallest real valued output of a classifiers before thresholding, as an indicator of generalisation performance. [sent-13, score-0.459]

7 Results in the YC, PAC and luckiness frameworks seem to indicate that a large margin is a pre- requisite for small generalisation error bounds (see [14, 12]). [sent-14, score-0.801]

8 These results caused many researchers to focus on large margin methods such as the well known support vector machine (SYM). [sent-15, score-0.493]

9 On the other hand, the notion of sparsity is deemed important for generalisation as can be seen from the popularity of Occam's razor like arguments as well as compression considerations (see [8]). [sent-16, score-0.678]

10 In this paper we reconcile the two notions by reinterpreting an improved version of Novikoff's well known perceptron convergence theorem as a sparsity guarantee in dual space: the existence of large margin classifiers implies the existence of sparse consistent classifiers in dual space. [sent-17, score-1.932]

11 Even better, this solution is easily found by the perceptron algorithm. [sent-18, score-0.493]

12 The paper is structured as follows: after introducing the perceptron in dual variables in Section 2 we improve on Novikoff's percept ron convergence bound in Section 3. [sent-20, score-1.05]

13 2 (Dual) Kernel Perceptrons We consider learning given m objects X = {Xl, . [sent-22, score-0.037]

14 Ym} E ym drawn iid from a fixed distribution P XY = P z over the space X x {-I, +I} = Z of input-output pairs. [sent-28, score-0.076]

15 Our hypotheses are linear classifiers X f-t sign ((w, 4> (x))) in some fixed feature space K ~ £~ where we assume that a mapping 4> : X --+ K is chosen a priori l . [sent-29, score-0.154]

16 Given the features ¢i : X --+ ~ the classical (primal) percept ron algorithm aims at finding a weight vector w E K consistent with the training data. [sent-30, score-0.521]

17 Recently, Vapnik [14] and others - in their work on SVMs - have rediscovered that it may be advantageous to learn in the dual representation (see [1]), i. [sent-31, score-0.087]

18 expanding the weight vector in terms of the training data m m L>~i4> (Xi) = 2: ctiXi, i=l Wo: = i=l (1) and learn the m expansion coefficients a E ~m rather than the components of w E K. [sent-33, score-0.126]

19 This is particularly useful if the dimensionality n = dim (K) of the feature space K is much greater (or possibly infinite) than the number m of training points. [sent-34, score-0.079]

20 This dual representation can be used for a rather wide class of learning algorithms (see [15]) - in particular if all we need for learning is the real valued output (w, Xi)/C of the classifier at the m training points Xl, . [sent-35, score-0.378]

21 Thus it suffices to choose a symmetric function k : X x X --+ ~ called kernel and to ensure that there exists a mapping 4>k : X --+ K such that Vx,x' EX: k (x,x') = (4)k (x) ,4>k (x'))/C (2) . [sent-39, score-0.169]

22 Ix Ix Vf E L2 (X): k(x,x') f (x) f (x') dx dx';::: 0, is called a Mercer kernel and has the following property: if 'if;i E L2 (X) solve the eigenvalue problem Ix k (x, x') 'if;i (x') dx' = Ai'if;dx) with Ix 'if;; (x) dx = 1 and Vi f:. [sent-44, score-0.386]

23 j : Ix 'if;i (x) 'if;j (x) dx = 0 then k can be expanded in a uniformly convergent series, i. [sent-45, score-0.123]

24 i=l In order to see that a Mercer kernel fulfils equation (2) consider the mapping 4>k (x) = ( A 'if;l (x), y'):;'if;2 (x), . [sent-48, score-0.138]

25 ) (3) whose existence is ensured by the third property. [sent-51, score-0.062]

26 Finally, the percept ron learning algorithm we are going to consider is described in the following definition. [sent-52, score-0.369]

27 The perceptron learning procedure with the fixed learning rate TJ E ~+ is as follows: 1. [sent-54, score-0.472]

28 Other variants of this algorithm have been presented elsewhere (see [2, 3]). [sent-72, score-0.031]

29 3 An Improvement of N ovikoff's Theorem In the early 60's Novikoff [10) was able to give an upper bound on the number of mistakes made by the classical perceptron learning procedure. [sent-73, score-0.82]

30 Two years later, this bound was generalised to feature spaces using Mercer kernels by Aizerman et al. [sent-74, score-0.274]

31 The quantity determining the upper bound is the maximally achievable unnormalised margin maxaElR. [sent-76, score-0.849]

32 ~ 'Yz (a) normalised by the total extent R(X) of the data in feature space, i. [sent-77, score-0.12]

33 Given a training set Z = (X, Y) and a vector a E IRm the unnormalised margin 'Yz (a) is given by 'Yz (a ) . [sent-81, score-0.553]

34 = ( Xi,y;)EZ Yi (Wa,Xi)J( mm IlwallJ( Theorem 2 (Novikoffs Percept ron Convergence Theorem 110,1]). [sent-82, score-0.178]

35 Suppose that there exists a vector a* E IRm such that 'Yz (a*) > O. [sent-84, score-0.107]

36 Then the number of mistakes made by the perceptron algorithm in Definition 1 on Z is at most ( R(X) 'Yz (a*) )2 Surprisingly, this bound is highly influenced by the data point Xi E X with the largest norm IIXil1 albeit rescaling of a data point would not change its classification. [sent-85, score-0.867]

37 Let us consider rescaling of the training set X before applying the perceptron algorithm. [sent-86, score-0.483]

38 Then for the normalised training set we would have R (Xnorm ) = 1 and 'Yz (a) would change into the normalised margin rz (a) first advocated in [6). [sent-87, score-0.79]

39 Given a training set Z = (X, Y) and a vector a E IRm the normalised margin rz (a) is given by r za ) = ( . [sent-89, score-0.687]

40 Hence for any a E IRm and all (Xi ,Yi) E Z such that Yi (Wa,Xi)J( > 0 > R(X) Yi(Wa,Xi)1C IIwall lC IIXillJ( Yi(Wa,Xi ) 1C Ilwall lC which immediately implies for all Z = (X, Y) E 1 Yi(Wa,Xi ) f , ' IIwalldxirlC zm such that 'Yz (a) > 0 R(X) > _1_. [sent-92, score-0.04]

41 'Yz (a) - rz (a) (5) Thus when normalising the data in feature space, i. [sent-93, score-0.158]

42 k(X',X') ' the upper bound on the number of steps until convergence of the classical perceptron learning procedure of Rosenblatt [11) is provably decreasing and is given by the squared r. [sent-99, score-0.776]

43 Considering the form of the update rule (4) we observe that this result not only bounds the number of mistakes made during learning but also the number 110:11 0 of non-zero coefficients in the 0: vector. [sent-102, score-0.233]

44 To be precise, for 'T/ = 1 it bounds the £1 norm 110:111 of the coefficient vector 0: which, in turn, bounds the zero norm 110:110 from above for all vectors with integer components. [sent-103, score-0.32]

45 Theorem 2 thus establishes a relation between the existence of a large margin classifier w* and the sparseness of any solution found by the perceptron algorithm. [sent-104, score-1.102]

46 4 Main Result In order to exploit the guaranteed sparseness of the solution of a kernel perceptron we make use of the following lemma to be found in [8, 4). [sent-105, score-0.713]

47 For any measure P z , the probability that m examples Z drawn iid according to Pz will yield a classifier 0: (Z) learned by the perceptron algorithm with 110: (Z)llo = d whose generalisation error PXY [Y (w a(Z), < 100 000. [sent-111, score-0.955]

48 )) +In(m)+ln(D) The most intriguing feature of this result is that the mere existence of a large margin classifier u* is sufficient to guarantee a small generalisation error for the solution u of the perceptron although its attained margin ~z (u) is likely to be much smaller than ~z (u*). [sent-115, score-1.945]

49 It has long been argued that the attained margin ~z (u) itself is the crucial quantity controlling the generalisation error of u. [sent-116, score-0.875]

50 In light of our new result if there exists a consistent classifier u* with large margin we know that there also exists at least one classifier u with high sparsity that can efficiently be found using the percept ron algorithm. [sent-117, score-1.361]

51 In fact, whenever the SYM appears to be theoretically justified by a large observed margin, every solution found by the perceptron algorithm has a small guaranteed generalisation error - mostly even smaller than current bounds on the generalisation error of SYMs. [sent-118, score-1.364]

52 Note that for a given training sample Z it is not unlikely that by permutation of Z there exist o ((,:'! [sent-119, score-0.048]

53 5 Impact on the Foundations of Support Vector Machines Support vector machines owe their popularity mainly to their theoretical justification in the learning theory. [sent-121, score-0.171]

54 In particular, two arguments have been put forward to single out the solutions found by SYMs [14, p. [sent-122, score-0.108]

55 The second reason is often justified by margin results (see [14, 12]) which bound the generalisation of a classifier u in terms of its own attained margin ~z (u). [sent-127, score-1.549]

56 If we require the slightly stronger condition that ~* < ~, n 2: 4, then our bound (8) for solutions of percept ron learning can be upper bounded by ~ (~*lnC::n)+ln(mn~1)+ln(c5n1~1))' which has to be compared with the PAC margin bound (see [12, 5]) ~ (64~*log2 (:::. [sent-128, score-1.247]

57 6 9 Table 1: Results of kernel perceptrons and SVMs on NIST (taken from [2, Table 3]). [sent-169, score-0.143]

58 The kernel used was k (x, x') = ((x, x') x + 1)4 and m = 60000. [sent-170, score-0.11]

59 For both algorithms we give the measured generalisation error (in %), the attained sparsity and the bound value (in %, 8 = 0. [sent-171, score-0.872]

60 • for any m and almost any 8 the margin bound given in Theorem 4 guarantees a smaller generalisation error . [sent-173, score-0.974]

61 153]) in the NIST handwritten digit recognition task and inserting this value into the PAC margin bound, it would need the astronomically large number of m > 410 743 386 to obtain a bound value of 0. [sent-175, score-0.677]

62 112 as obtained by (3) for the digit "0" (see Table 1). [sent-176, score-0.055]

63 With regard to the first reason, it has been confirmed experimentally that SVMs find solutions which are sparse in the expansion coefficients o. [sent-177, score-0.115]

64 However, there cannot exist any distribution- free guarantee that the number of support vectors will in fact be sma1l 2 . [sent-178, score-0.128]

65 In contrast, Theorem 2 gives an explicit bound on the sparsity in terms of the achievable margin (0*). [sent-179, score-0.848]

66 Furthermore, experimental results on the NIST datasets show that the sparsity of solution found by the perceptron algorithm is consistently (and often by a factor of two) greater than that of the SVM solution (see [2, Table 3] and Table 1). [sent-180, score-0.765]

67 This result implies that the SVM solution is not at all singled out as being superior in terms of provable generalisation error. [sent-182, score-0.39]

68 Also, the result indicates that sparsity of the solution may be a more fundamental property than the size of the attained margin (since a large value of the latter implies a large value of the former). [sent-183, score-0.767]

69 Our analysis raises an interesting question: having chosen a good kernel, corresponding to a metric in which inter- class distances are great and intra- class distances are short, in how far does it matter which consistent classifier we use? [sent-184, score-0.254]

70 Experimental 2Consider a distribution PXY on two parallel lines with support in the unit ball. [sent-185, score-0.07]

71 Then the number of support vectors equals the training set size whereas the perceptron algorithm never uses more than two points by Theorem 2. [sent-189, score-0.576]

72 One could argue that it is the number of essential support vectors [13] that characterises the data compression of an SVM (which would also have been two in our example). [sent-190, score-0.207]

73 Their determination, however, involves a combinatorial optimisation problem and can thus never be performed in practical applications. [sent-191, score-0.029]

74 results seem to indicate that a vast variety of heuristics for finding consistent classifiers, e. [sent-192, score-0.057]

75 kernel Fisher discriminant, linear programming machines, Bayes point machines, kernel PCA & linear SVM, sparse greedy matrix approximation perform comparably (see http://www . [sent-194, score-0.275]

76 They would like to thank Peter Bartlett and Jon Baxter for many interesting discussions. [sent-198, score-0.031]

77 Furthermore, we would like to thank the anonymous reviewer, Olivier Bousquet and Matthias Seeger for very useful remarks on the paper. [sent-199, score-0.031]

78 Theoretical foundations of the potential function method in pattern recognition learning. [sent-204, score-0.035]

79 The Kernel-Adatron: A fast and simple learning procedure for Support Vector Machines. [sent-216, score-0.037]

80 A PAC-Bayesian margin bound for linear classifiers: Why SVMs work. [sent-233, score-0.591]

81 Functions of positive and negative type and their connection with the theory of integral equations. [sent-246, score-0.027]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('perceptron', 0.398), ('margin', 0.375), ('generalisation', 0.302), ('bound', 0.216), ('sparsity', 0.193), ('percept', 0.164), ('ron', 0.137), ('classifier', 0.135), ('rz', 0.127), ('classifiers', 0.123), ('dx', 0.123), ('attained', 0.111), ('kernel', 0.11), ('novikoff', 0.109), ('compression', 0.106), ('theorem', 0.101), ('irm', 0.099), ('pac', 0.097), ('nist', 0.095), ('sym', 0.095), ('mercer', 0.092), ('mistakes', 0.092), ('normalised', 0.089), ('dual', 0.087), ('yi', 0.083), ('unnormalised', 0.082), ('bounds', 0.074), ('svm', 0.073), ('wo', 0.071), ('support', 0.07), ('herbrich', 0.064), ('achievable', 0.064), ('ix', 0.063), ('iixillj', 0.063), ('pxy', 0.063), ('reinterpreting', 0.063), ('syms', 0.063), ('existence', 0.062), ('norm', 0.062), ('svms', 0.06), ('exists', 0.059), ('guarantee', 0.058), ('consistent', 0.057), ('digit', 0.055), ('sparse', 0.055), ('xi', 0.054), ('error', 0.05), ('table', 0.05), ('berlin', 0.049), ('littlestone', 0.049), ('aizerman', 0.049), ('solution', 0.048), ('vector', 0.048), ('convergence', 0.048), ('ln', 0.048), ('williamson', 0.048), ('training', 0.048), ('found', 0.047), ('lemma', 0.046), ('graepel', 0.046), ('popularity', 0.046), ('mistake', 0.043), ('definition', 0.041), ('upper', 0.041), ('ez', 0.041), ('xy', 0.041), ('mm', 0.041), ('implies', 0.04), ('machines', 0.04), ('iid', 0.039), ('quantity', 0.037), ('learning', 0.037), ('sparseness', 0.037), ('rescaling', 0.037), ('lc', 0.037), ('ym', 0.037), ('classical', 0.036), ('foundations', 0.035), ('justified', 0.035), ('maximally', 0.034), ('valued', 0.034), ('perceptrons', 0.033), ('former', 0.032), ('feature', 0.031), ('bounded', 0.031), ('would', 0.031), ('arguments', 0.031), ('distances', 0.031), ('guarantees', 0.031), ('algorithm', 0.031), ('coefficients', 0.03), ('solutions', 0.03), ('bartlett', 0.03), ('eigenvalue', 0.03), ('never', 0.029), ('see', 0.028), ('guaranteed', 0.027), ('spaces', 0.027), ('theory', 0.027), ('friess', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

2 0.45475784 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

3 0.41141436 75 nips-2000-Large Scale Bayes Point Machines

Author: Ralf Herbrich, Thore Graepel

Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1

4 0.26403606 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

5 0.2182288 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

6 0.20312481 111 nips-2000-Regularized Winnow Methods

7 0.19705187 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

8 0.1842722 133 nips-2000-The Kernel Gibbs Sampler

9 0.1457486 21 nips-2000-Algorithmic Stability and Generalization Performance

10 0.1395741 130 nips-2000-Text Classification using String Kernels

11 0.13952592 54 nips-2000-Feature Selection for SVMs

12 0.1339345 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

13 0.13143611 13 nips-2000-A Tighter Bound for Graphical Models

14 0.1261839 18 nips-2000-Active Support Vector Machine Classification

15 0.11647652 134 nips-2000-The Kernel Trick for Distances

16 0.11603813 44 nips-2000-Efficient Learning of Linear Perceptrons

17 0.11506929 4 nips-2000-A Linear Programming Approach to Novelty Detection

18 0.11135891 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

19 0.10724527 120 nips-2000-Sparse Greedy Gaussian Process Regression

20 0.10500421 110 nips-2000-Regularization with Dot-Product Kernels


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.325), (1, 0.455), (2, -0.252), (3, 0.023), (4, -0.121), (5, -0.251), (6, 0.058), (7, -0.006), (8, 0.064), (9, -0.065), (10, 0.129), (11, -0.106), (12, -0.125), (13, 0.021), (14, -0.109), (15, -0.023), (16, -0.058), (17, -0.009), (18, -0.004), (19, 0.094), (20, 0.045), (21, 0.062), (22, -0.147), (23, 0.012), (24, 0.021), (25, 0.031), (26, -0.019), (27, -0.031), (28, 0.006), (29, -0.025), (30, -0.018), (31, -0.022), (32, 0.026), (33, -0.031), (34, -0.02), (35, 0.011), (36, -0.057), (37, 0.017), (38, 0.028), (39, -0.051), (40, -0.011), (41, -0.009), (42, 0.019), (43, 0.031), (44, 0.054), (45, 0.01), (46, 0.013), (47, 0.003), (48, -0.025), (49, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97148043 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

2 0.90742934 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

3 0.79527575 75 nips-2000-Large Scale Bayes Point Machines

Author: Ralf Herbrich, Thore Graepel

Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1

4 0.71526414 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

5 0.65154964 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

6 0.60041147 111 nips-2000-Regularized Winnow Methods

7 0.59335619 133 nips-2000-The Kernel Gibbs Sampler

8 0.48625258 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

9 0.44404778 44 nips-2000-Efficient Learning of Linear Perceptrons

10 0.43350452 21 nips-2000-Algorithmic Stability and Generalization Performance

11 0.41042182 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

12 0.40911222 54 nips-2000-Feature Selection for SVMs

13 0.40340915 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

14 0.3858324 18 nips-2000-Active Support Vector Machine Classification

15 0.36687499 130 nips-2000-Text Classification using String Kernels

16 0.34852064 13 nips-2000-A Tighter Bound for Graphical Models

17 0.34795353 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

18 0.34767115 138 nips-2000-The Use of Classifiers in Sequential Inference

19 0.33845443 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

20 0.31643277 4 nips-2000-A Linear Programming Approach to Novelty Detection


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.04), (17, 0.094), (32, 0.018), (33, 0.472), (45, 0.039), (55, 0.012), (62, 0.024), (65, 0.012), (67, 0.045), (75, 0.011), (76, 0.057), (79, 0.014), (81, 0.012), (90, 0.046), (97, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95261103 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

2 0.94845086 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

Author: Lucas C. Parra, Clay Spence, Paul Sajda

Abstract: We present evidence that several higher-order statistical properties of natural images and signals can be explained by a stochastic model which simply varies scale of an otherwise stationary Gaussian process. We discuss two interesting consequences. The first is that a variety of natural signals can be related through a common model of spherically invariant random processes, which have the attractive property that the joint densities can be constructed from the one dimensional marginal. The second is that in some cases the non-stationarity assumption and only second order methods can be explicitly exploited to find a linear basis that is equivalent to independent components obtained with higher-order methods. This is demonstrated on spectro-temporal components of speech. 1

3 0.94253582 148 nips-2000-`N-Body' Problems in Statistical Learning

Author: Alexander G. Gray, Andrew W. Moore

Abstract: We present efficient algorithms for all-point-pairs problems , or 'Nbody '-like problems , which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection , and the two-point correlation. These include any problem which abstractly requires a comparison of each of the N points in a dataset with each other point and would naively be solved using N 2 distance computations. In practice N is often large enough to make this infeasible. We present a suite of new geometric t echniques which are applicable in principle to any 'N-body ' computation including large-scale mixtures of Gaussians, RBF neural networks, and HMM 's. Our algorithms exhibit favorable asymptotic scaling and are empirically several orders of magnitude faster than the naive computation, even for small datasets. We are aware of no exact algorithms for these problems which are more efficient either empirically or theoretically. In addition, our framework yields simple and elegant algorithms. It also permits two important generalizations beyond the standard all-point-pairs problems, which are more difficult. These are represented by our final examples, the multiple two-point correlation and the notorious n-point correlation. 1

4 0.62452012 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

5 0.61680734 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

6 0.55658591 75 nips-2000-Large Scale Bayes Point Machines

7 0.54425561 37 nips-2000-Convergence of Large Margin Separable Linear Classification

8 0.53618854 131 nips-2000-The Early Word Catches the Weights

9 0.51804245 21 nips-2000-Algorithmic Stability and Generalization Performance

10 0.51674861 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

11 0.51006687 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition

12 0.50973755 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

13 0.50656313 111 nips-2000-Regularized Winnow Methods

14 0.50423658 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

15 0.50377923 94 nips-2000-On Reversing Jensen's Inequality

16 0.48996851 36 nips-2000-Constrained Independent Component Analysis

17 0.4819141 97 nips-2000-Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping

18 0.47682479 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

19 0.47502556 44 nips-2000-Efficient Learning of Linear Perceptrons

20 0.47271821 74 nips-2000-Kernel Expansions with Unlabeled Examples