nips nips2000 nips2000-7 knowledge-graph by maker-knowledge-mining

7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Source: pdf

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 it Abstract A new incremental learning algorithm is described which approximates the maximal margin hyperplane w. [sent-3, score-0.736]

2 Our algorithm, called ALMAp (Approximate Large Margin algorithm w. [sent-7, score-0.081]

3 norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. [sent-10, score-0.896]

4 Our algorithm seems to perform quite better than both. [sent-14, score-0.151]

5 The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). [sent-15, score-0.071]

6 On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms. [sent-16, score-0.133]

7 1 Introduction A great deal of effort has been devoted in recent years to the study of maximal margin classifiers. [sent-17, score-0.431]

8 In this paper we focus on special maximal margin classifiers, i. [sent-19, score-0.431]

9 Briefly, given a set linearly separable data, the maximal margin hyperplane classifies all the data correctly and maximizes the minimal distance between the data and the hyperplane. [sent-22, score-0.785]

10 If euclidean norm is used to measure the distance then computing the maximal margin hyperplane corresponds to the, by now classical, Support Vector Machines (SVMs) training problem [3]. [sent-23, score-0.713]

11 This task is naturally formulated as a quadratic programming problem. [sent-24, score-0.042]

12 If an arbitrary norm p is used then such a task turns to a more general mathematical programming problem (see, e. [sent-25, score-0.131]

13 A major theme of this paper is to devise simple and efficient algorithms to solve the maximal margin hyperplane problem. [sent-29, score-0.637]

14 The first contribution is a new efficient algorithm which approximates the maximal margin hyperplane w. [sent-31, score-0.705]

15 We call this algorithm ALMAp (Approximate Large Margin algorithm w. [sent-35, score-0.162]

16 , as an algorithm which processes the examples one at a time. [sent-41, score-0.081]

17 In this sense, ALMAp is a refinement of the on-line algorithms recently introduced in [2]. [sent-43, score-0.045]

18 , ALMAp with p = 2) is a perceptron-like algorithm; the operations it performs can be expressed as dot products, so that we can replace them by kernel functions evaluations. [sent-46, score-0.076]

19 As far as theoretical performance is concerned, ALMA2 achieves essentially the same bound on the number of corrections as the one obtained by a version of Li and Long's ROMMA algorithm [12], though the two algorithms are different. [sent-48, score-0.432]

20 l In the case that p is logarithmic in the dimension of the instance space (as in [6]) ALMAp yields results which are similar to those obtained by estimators based on linear programming (see [1, Chapter 14]). [sent-49, score-0.088]

21 The second contribution of this paper is an experimental investigation of ALMA2 on the problem of handwritten digit recognition. [sent-50, score-0.114]

22 We ran ALMA2 with polynomial kernels, using both the last and the voted hypotheses (as in [4]), and we compared our results to those described in [3, 4, 12]. [sent-52, score-0.347]

23 We found that voted ALMA2 generalizes quite better than both ROMMA and the voted Perceptron algorithm, but slightly worse than standard SVMs. [sent-53, score-0.587]

24 On the other hand ALMA2 is much faster and easier to implement than standard SVMs training algorithms. [sent-54, score-0.096]

25 In Section 3 we describe ALMAp and claim its theoretical properties. [sent-58, score-0.066]

26 2 Preliminaries and notation An example is a pair (x, y), where x is an instance belonging to n nand y E {-1, +1} is the binary label associated with x. [sent-61, score-0.126]

27 , w n ) E nn represents an n-dimensional hyperplane passing through the origin. [sent-65, score-0.308]

28 We use p-norms for instances and q-norms for weight vectors. [sent-75, score-0.047]

29 The (normalized) p-norm margin (or just the margin, if p is clear from the surrounding context) of a hyperplane w with Ilwll q :s; 1 on example (x,y) is defined as ~I~I'I~' If this margin is positive 2 then w classifies (x, y) correctly. [sent-76, score-0.828]

30 Our goal is to approximate the maximal p-norm margin hyperplane for a set of examples (the training set). [sent-79, score-0.659]

31 For this purpose, we use terminology and analytical tools from the on-line lIn fact, algorithms such as ROMMA and the one contained in Kowalczyk [10] have been specifically designed for euclidean norm. [sent-80, score-0.045]

32 Any straightforward extension of these algorithms to a general norm p seems to require numerical methods . [sent-81, score-0.167]

33 An on-line learning algorithm processes the examples one at a time in trials. [sent-86, score-0.081]

34 In each trial, the algorithm observes an instance x and is required to predict the label y associated with x . [sent-87, score-0.175]

35 The prediction f) combines the current instance x with the current internal state of the algorithm. [sent-89, score-0.111]

36 In our case this state is essentially a weight vector w , representing the algorithm's current hypothesis about the maximal margin hyperplane. [sent-90, score-0.574]

37 After the prediction is made, the true value of y is revealed and the algorithm suffers a loss, measuring the "distance" between the prediction f) and the label y. [sent-91, score-0.22]

38 x and the loss is a margin-based 0-1 Loss: the loss of W on example (x, y) is 1 if ~I~I'I~ ~ (1 - a) "y and o otherwise, for suitably chosen a, "y E [0,1]. [sent-94, score-0.162]

39 Therefore, if Ilwll q ~ 1 then the algorithm incurs positive loss if and only if w classifies (x, y) with (p-norm) margin not larger than (1- a) "y. [sent-95, score-0.514]

40 , they do update their internal state only in those trials where they suffer a positive loss. [sent-98, score-0.107]

41 In the special case when a = 1 a correction is a mistaken trial and a loss driven algorithm turns to a mistake driven [14] algorithm. [sent-100, score-0.378]

42 Throughout the paper we use the subscript t for x and y to denote the instance and the label processed in trial t. [sent-101, score-0.179]

43 We use the subscript k for those variables, such as the algorithm's weight vectorw, which are updated only within a correction. [sent-102, score-0.078]

44 In particular, Wk denotes the algorithm's weight vector after k-l corrections (so that WI is the initial weight vector). [sent-103, score-0.314]

45 The goal of the on-line algorithm is to bound the cumulative loss (i. [sent-104, score-0.222]

46 , the total number of corrections or mistakes) it suffers on an arbitrary sequence of examples S = (Xl, yd, . [sent-106, score-0.216]

47 If S is linearly separable with margin "y and we pick a < 1 then a bounded loss clearly implies convergence in a finite number of steps to (an approximation of) the maximal margin hyperplane for S. [sent-110, score-1.074]

48 3 The approximate large margin algorithm ALMAp ALMAp is a large margin variant of the p-norm Perceptron algorithm 3 [8, 6], and is similar in spirit to the variable learning rate algorithms introduced in [2]. [sent-111, score-0.802]

49 Part 1 bounds the number of corrections in the linearly separable case only. [sent-115, score-0.308]

50 In the special case when p = 2 this bound is very similar to the one proven by Li and Long for a version of ROMMA [12]. [sent-116, score-0.1]

51 A bound which is very close to the one proven in [8, 6] for the (constant learning rate) p-norm Perceptron algorithm is obtained as a special case. [sent-118, score-0.181]

52 Despite this theoretical similarity, the experiments we report in Section 4 show that using our margin-sensitive variable learning rate algorithm yields a clear increase in performance. [sent-119, score-0.145]

53 In order to define our algorithm, we need to recall the following mapping f from [6] (a p-indexing for f is understood): f : nn -t nn, f = (h, . [sent-120, score-0.147]

54 The (unique) inverse f- 1 of f is [6] f- 1 : nn -t nn, f- 1 = U11, . [sent-128, score-0.147]

55 3The p-norm Perceptron algorithm is a generalization of the classical Perceptron algorithm [18]: p-norm Perceptron is actually Perceptron when p = 2. [sent-135, score-0.197]

56 Initialization: Initial weight vector WI = 0; k = 1. [sent-137, score-0.08]

57 ,Tdo: Get example (Xt, Yt) E nn x {-I, +1} and update weights as follows: Set 'Yk = B JP=l ~j If Yt Wk路Xt <_ (1 - a) ""k then: I Ilxtll p '11 - 'Ik - C I v'P=I IIXtll p Vii' = f-l(f(Wk) + 'T/k Yt Xt), Wk+1 = w~/max{l, Ilw~llq}, W~ k+-k+1. [sent-141, score-0.184]

58 Figure 1: The approximate large margin algorithm ALMA p . [sent-142, score-0.396]

59 The algorithm is parameterized by a E (0,1], B > 0 and C > O. [sent-144, score-0.081]

60 The parameter a measures the degree of approximation to the optimal margin hyperplane, while Band C might be considered as tuning parameters. [sent-145, score-0.28]

61 At time t the algorithm processes the example (Xt, Yt). [sent-150, score-0.081]

62 If the current weight vector Wk classifies (Xt, Yt) with margin not larger than (1 - a) 'Yk then a correction occurs. [sent-151, score-0.482]

63 The first step gives w~ through the classical update of a (p-norm) perceptron-like algorithm (notice, however, that the learning rate 'T/k scales with k, the number of corrections occurred so far). [sent-153, score-0.34]

64 The projection step makes the new weight vector Wk+1 belong to W. [sent-155, score-0.08]

65 Here we claim that a special choice of the parameters Band C gives rise to an algorithm which approximates the maximal margin hyperplane to any given accurary a. [sent-158, score-0.742]

66 In part 2 we claim that if a suitable relationship between the parameters Band C is satisfied then a bound on the number of corrections can be proven in the general (nonseparable) case. [sent-159, score-0.363]

67 The bound of part 2 is in terms of the margin-based quantity V-y(Uj (x,y)) = max{O,'Y - rl~i~}' 'Y > O. [sent-160, score-0.099]

68 V-y is called deviation in [4] and linear hinge loss in [7]. [sent-162, score-0.127]

69 Notice that Band C in part 1 do not meet the requirements given in part 2. [sent-163, score-0.078]

70 On the other hand, in the separable case Band C chosen in part 2 do not yield a hyperplane which is arbitrarily (up to a small a) close to the maximal margin hyperplane. [sent-164, score-0.719]

71 Theorem 1 Let W = {w E nn : Ilwllq :::; I}, S = ((xI,yd, . [sent-165, score-0.147]

72 '" (XT,YT)) E (nn x { -1, + 1}) T, and M be the set of corrections of ALM Ap( aj B, C) running on S (i. [sent-166, score-0.263]

73 Hence (1) is also an upper bound on the number of trials t such that Vil~,\I~' (1 - a) 'Y*. [sent-178, score-0.096]

74 Then for any u E W, ALMAp( a; B, C) achieves the followin g bound on any'Y> 0, where p2 = J2~2: 1M I, holding for Observe that when a = 1 the above inequality turns to a bound on the number of mistaken trials. [sent-181, score-0.16]

75 0 When p = 2 the computations performed by ALMAp essentially involve only dot products (recall that p = 2 yields q = 2 and [ = [-1 = identity). [sent-183, score-0.066]

76 Thus the generalization of ALMA2 to the kernel case is quite standard. [sent-184, score-0.077]

77 Here the denominator Nk+1 k+l equals max{1, Ilw~112} and the norm Ilw~J12 is again computed recursively by Ilw~ll~ = Ilw~_lIIVN~ + 2'T/k YtWk' Xt + 'T/~ Ilxtl12' where the dot product Wk' Xt is taken from the k-th correction (the trial where the k-th weight update did occur). [sent-189, score-0.313]

78 4 Experimental results We did some experiments running ALMA2 on the well-known MNIST OCR database. [sent-190, score-0.042]

79 We compared to SVMs, the Perceptron algorithm and the Perceptron-like algorithm ROMMA [12]. [sent-209, score-0.162]

80 In particular, we did not determine the best instance scaling factor s for our algorithm (this corresponds to using the kernel K (x, y) = (1 + x . [sent-215, score-0.167]

81 As in [4], the output of a binary classifier is based on either the last hypothesis produced by the algorithms (denoted by "last" in Table 1) or Helmbold and Warmuth's [9] leave-one-out voted hypothesis (denoted by "voted"). [sent-222, score-0.49]

82 We trained the algorithms by cycling up to 3 times ("epochs") over the training set. [sent-224, score-0.077]

83 "Corr's" give the total number of corrections made in the training phase for the 10 labels. [sent-235, score-0.219]

84 Our own experimental results are given in the last six rows. [sent-238, score-0.11]

85 These results also show how accuracy and running time (as well as sparsity) can be traded-off against each other in a transparent way. [sent-244, score-0.08]

86 The accuracy of our algorithm is slightly worse than SVMs'. [sent-245, score-0.119]

87 On the other hand, our algorithm is quite faster and easier to implement than previous implementations of SVMs, such as those given in [17,5] . [sent-246, score-0.182]

88 An interesting features of ALMA2 is that its approximate solution relies on fewer support vectors than the SVM solution. [sent-247, score-0.067]

89 In fact, the algorithm is rather fast: training for one epoch the ten binary classifiers of ALMA2(1. [sent-251, score-0.216]

90 This seems to make little difference for MNIST dataset, but there are cases when a biased maximal margin hyperplane generalizes quite better than an unbiased one. [sent-259, score-0.662]

91 "TestErr" denotes the fraction of misclassified patterns in the test set, while "Corr's" gives the total number of training corrections for the 10 labels. [sent-263, score-0.219]

92 Thus the number of corrections of "last" is the same as the number of corrections of "voted". [sent-265, score-0.374]

93 It is not clear to us how to incorporate any noise control mechanism into the classical Perceptron algorithm. [sent-297, score-0.07]

94 9 According to [12] , ROMMA's last hypothesis seems to perform better than ROMMA's voted hypothesis. [sent-299, score-0.413]

95 The kernel adatron algorithm: a fast and simple leaming procedure for support vector machines. [sent-325, score-0.141]

96 In Smola, Bartlett, Scholkopf, and Schuurmans editors, Advances in large margin classifiers, MIT Press, 1999. [sent-354, score-0.28]

97 Vapnik, Comparison of learning algorithms for handwritten digit recognition. [sent-369, score-0.121]

98 From support vector machines to large margin classifiers, PhD Thesis, School of Computing, the National University of Singapore, 2000. [sent-378, score-0.374]

99 Fast training of support vector machines using sequential minimal optimization. [sent-396, score-0.126]

100 In Scholkopf, Burges and Smola editors, Advances in kernel methods: support vector machines, MIT Press, 1998. [sent-397, score-0.105]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('almap', 0.551), ('margin', 0.28), ('voted', 0.275), ('romma', 0.207), ('wk', 0.199), ('corrections', 0.187), ('perceptron', 0.162), ('hyperplane', 0.161), ('maximal', 0.151), ('nn', 0.147), ('ilw', 0.119), ('gentile', 0.115), ('corr', 0.099), ('svms', 0.097), ('testerr', 0.092), ('norm', 0.089), ('separable', 0.088), ('algorithm', 0.081), ('xt', 0.081), ('loss', 0.081), ('yt', 0.08), ('band', 0.078), ('last', 0.072), ('classifies', 0.072), ('ilwll', 0.072), ('bound', 0.06), ('helmbold', 0.059), ('llq', 0.059), ('wi', 0.057), ('trial', 0.054), ('colt', 0.054), ('littlestone', 0.054), ('epochs', 0.05), ('correction', 0.05), ('label', 0.048), ('weight', 0.047), ('mnist', 0.047), ('instance', 0.046), ('hinge', 0.046), ('ilwllp', 0.046), ('ilwllq', 0.046), ('kowalczyk', 0.046), ('llwiip', 0.046), ('milano', 0.046), ('universita', 0.046), ('algorithms', 0.045), ('running', 0.042), ('programming', 0.042), ('digit', 0.04), ('proven', 0.04), ('classifiers', 0.04), ('kernel', 0.04), ('friess', 0.04), ('llp', 0.04), ('mistaken', 0.04), ('nc', 0.04), ('part', 0.039), ('accuracy', 0.038), ('experimental', 0.038), ('update', 0.037), ('quite', 0.037), ('claim', 0.037), ('table', 0.036), ('dot', 0.036), ('driven', 0.036), ('handwritten', 0.036), ('trials', 0.036), ('adatron', 0.036), ('imi', 0.036), ('lecun', 0.036), ('clear', 0.035), ('approximate', 0.035), ('classical', 0.035), ('li', 0.034), ('internal', 0.034), ('aj', 0.034), ('linearly', 0.033), ('remarks', 0.033), ('seems', 0.033), ('vector', 0.033), ('hypothesis', 0.033), ('implement', 0.033), ('approximates', 0.032), ('support', 0.032), ('training', 0.032), ('binary', 0.032), ('dual', 0.032), ('epoch', 0.031), ('incremental', 0.031), ('remarkable', 0.031), ('subscript', 0.031), ('easier', 0.031), ('notice', 0.031), ('prediction', 0.031), ('theorem', 0.031), ('essentially', 0.03), ('theoretical', 0.029), ('concluding', 0.029), ('suffers', 0.029), ('machines', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

2 0.26403606 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

3 0.20437849 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

4 0.15639651 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

5 0.14739853 35 nips-2000-Computing with Finite and Infinite Networks

Author: Ole Winther

Abstract: Using statistical mechanics results, I calculate learning curves (average generalization error) for Gaussian processes (GPs) and Bayesian neural networks (NNs) used for regression. Applying the results to learning a teacher defined by a two-layer network, I can directly compare GP and Bayesian NN learning. I find that a GP in general requires CJ (d S )-training examples to learn input features of order s (d is the input dimension), whereas a NN can learn the task with order the number of adjustable weights training examples. Since a GP can be considered as an infinite NN, the results show that even in the Bayesian approach, it is important to limit the complexity of the learning machine. The theoretical findings are confirmed in simulations with analytical GP learning and a NN mean field algorithm.

6 0.14304066 75 nips-2000-Large Scale Bayes Point Machines

7 0.14035615 111 nips-2000-Regularized Winnow Methods

8 0.13709989 54 nips-2000-Feature Selection for SVMs

9 0.11373305 4 nips-2000-A Linear Programming Approach to Novelty Detection

10 0.11153577 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

11 0.10142462 44 nips-2000-Efficient Learning of Linear Perceptrons

12 0.088845581 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

13 0.083648987 133 nips-2000-The Kernel Gibbs Sampler

14 0.081584394 18 nips-2000-Active Support Vector Machine Classification

15 0.081062689 130 nips-2000-Text Classification using String Kernels

16 0.077566549 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

17 0.070021547 122 nips-2000-Sparse Representation for Gaussian Process Models

18 0.06642016 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

19 0.065851107 134 nips-2000-The Kernel Trick for Distances

20 0.064190008 74 nips-2000-Kernel Expansions with Unlabeled Examples

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.242), (1, 0.258), (2, -0.138), (3, -0.02), (4, -0.078), (5, -0.131), (6, 0.028), (7, 0.029), (8, 0.033), (9, -0.036), (10, 0.046), (11, -0.032), (12, -0.021), (13, -0.023), (14, 0.096), (15, -0.074), (16, -0.141), (17, 0.052), (18, -0.051), (19, -0.033), (20, 0.188), (21, -0.047), (22, -0.034), (23, 0.158), (24, -0.032), (25, -0.009), (26, -0.017), (27, -0.213), (28, -0.036), (29, -0.011), (30, -0.039), (31, -0.14), (32, -0.013), (33, 0.097), (34, -0.06), (35, 0.05), (36, -0.027), (37, -0.108), (38, -0.119), (39, -0.113), (40, -0.023), (41, -0.013), (42, 0.065), (43, 0.028), (44, 0.084), (45, -0.043), (46, -0.027), (47, -0.093), (48, -0.003), (49, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94745958 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

2 0.75808883 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

3 0.71605563 111 nips-2000-Regularized Winnow Methods

Author: Tong Zhang

Abstract: In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.

4 0.66245955 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

5 0.54780394 44 nips-2000-Efficient Learning of Linear Perceptrons

Author: Shai Ben-David, Hans-Ulrich Simon

Abstract: We consider the existence of efficient algorithms for learning the class of half-spaces in ~n in the agnostic learning model (Le., making no prior assumptions on the example-generating distribution). The resulting combinatorial problem - finding the best agreement half-space over an input sample - is NP hard to approximate to within some constant factor. We suggest a way to circumvent this theoretical bound by introducing a new measure of success for such algorithms. An algorithm is IL-margin successful if the agreement ratio of the half-space it outputs is as good as that of any half-space once training points that are inside the IL-margins of its separating hyper-plane are disregarded. We prove crisp computational complexity results with respect to this success measure: On one hand, for every positive IL, there exist efficient (poly-time) IL-margin successful learning algorithms. On the other hand, we prove that unless P=NP, there is no algorithm that runs in time polynomial in the sample size and in 1/ IL that is IL-margin successful for all IL> O. 1

6 0.46627572 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

7 0.43666667 54 nips-2000-Feature Selection for SVMs

8 0.41081911 35 nips-2000-Computing with Finite and Infinite Networks

9 0.40378085 75 nips-2000-Large Scale Bayes Point Machines

10 0.38738075 18 nips-2000-Active Support Vector Machine Classification

11 0.38162294 4 nips-2000-A Linear Programming Approach to Novelty Detection

12 0.36733678 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

13 0.33754167 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

14 0.28849408 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

15 0.28059119 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

16 0.27949566 122 nips-2000-Sparse Representation for Gaussian Process Models

17 0.27392042 12 nips-2000-A Support Vector Method for Clustering

18 0.27085474 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

19 0.27029982 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

20 0.26012537 21 nips-2000-Algorithmic Stability and Generalization Performance

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.05), (17, 0.107), (29, 0.23), (32, 0.023), (33, 0.11), (45, 0.012), (55, 0.017), (62, 0.051), (65, 0.029), (67, 0.039), (75, 0.021), (76, 0.053), (77, 0.015), (79, 0.016), (81, 0.031), (90, 0.07), (97, 0.018), (99, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81527501 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

2 0.60434431 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

3 0.58855039 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

4 0.58406222 148 nips-2000-`N-Body' Problems in Statistical Learning

Author: Alexander G. Gray, Andrew W. Moore

Abstract: We present efficient algorithms for all-point-pairs problems , or 'Nbody '-like problems , which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection , and the two-point correlation. These include any problem which abstractly requires a comparison of each of the N points in a dataset with each other point and would naively be solved using N 2 distance computations. In practice N is often large enough to make this infeasible. We present a suite of new geometric t echniques which are applicable in principle to any 'N-body ' computation including large-scale mixtures of Gaussians, RBF neural networks, and HMM 's. Our algorithms exhibit favorable asymptotic scaling and are empirically several orders of magnitude faster than the naive computation, even for small datasets. We are aware of no exact algorithms for these problems which are more efficient either empirically or theoretically. In addition, our framework yields simple and elegant algorithms. It also permits two important generalizations beyond the standard all-point-pairs problems, which are more difficult. These are represented by our final examples, the multiple two-point correlation and the notorious n-point correlation. 1

5 0.58159566 75 nips-2000-Large Scale Bayes Point Machines

Author: Ralf Herbrich, Thore Graepel

Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1

6 0.58001029 37 nips-2000-Convergence of Large Margin Separable Linear Classification

7 0.57811826 74 nips-2000-Kernel Expansions with Unlabeled Examples

8 0.57465273 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

9 0.57378 111 nips-2000-Regularized Winnow Methods

10 0.56546777 4 nips-2000-A Linear Programming Approach to Novelty Detection

11 0.56271446 21 nips-2000-Algorithmic Stability and Generalization Performance

12 0.56228626 52 nips-2000-Fast Training of Support Vector Classifiers

13 0.55888337 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

14 0.55857891 133 nips-2000-The Kernel Gibbs Sampler

15 0.55415684 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

16 0.55300611 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

17 0.55291373 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

18 0.55230933 36 nips-2000-Constrained Independent Component Analysis

19 0.54633218 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

20 0.5455659 130 nips-2000-Text Classification using String Kernels