nips nips2000 nips2000-111 knowledge-graph by maker-knowledge-mining

111 nips-2000-Regularized Winnow Methods


Source: pdf

Author: Tong Zhang

Abstract: In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. [sent-4, score-0.297]

2 Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. [sent-5, score-0.068]

3 Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. [sent-6, score-0.449]

4 We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. [sent-7, score-0.068]

5 1 Introduction In this paper, we consider the binary classification problem that is to determine a label y E {-1, 1} associated with an input vector x. [sent-10, score-0.082]

6 Specifically, we seek a weight vector W and a threshold () such that w T x < () if its label y = -1 and w T x 2: () if its label y = 1. [sent-12, score-0.255]

7 These algorithms typically fix the threshold () and update the weight vector w by going through the training data repeatedly. [sent-18, score-0.347]

8 They are mistake driven in the sense that the weight vector is updated only when the algorithm is not able to correctly classify an example. [sent-19, score-0.293]

9 For the Perceptron algorithm, the update rule is additive: if the linear discriminant function misclassifies an input training vector xi with true label yi, then we update each component j of the weight vector was: Wj f- Wj + T]X~yi, where T] > 0 is a parameter called learning rate. [sent-20, score-0.625]

10 The Winnow algorithm belongs to a general family of algo- rithms called exponentiated gradient descent with unnormalized weights (EGU) [9]. [sent-23, score-0.235]

11 This modification allows the positive weight Winnow algorithm for the augmented input x to have the effect of both positive and negative weights for the original input x. [sent-26, score-0.129]

12 Another modification is to normalize the one-norm of the weight w so that 2:j Wj = W, leading to the normalized Winnow. [sent-27, score-0.175]

13 Theoretical properties of multiplicative update algorithms have been extensively studied since the introduction of Winnow. [sent-28, score-0.159]

14 For linearly separable binary-classification problems, both Perceptron and Winnow are able to find a weight that separate the in-class vectors from the out-of-class vectors in the training set within a finite number of steps. [sent-29, score-0.289]

15 However, the number of mistakes (updates) before finding a separating hyperplane can be very different [10, 9]. [sent-30, score-0.109]

16 For linearly separable problems, Vapnik proposed a method that optimizes the Perceptron mistake bound which he calls "optimal hyperplane" (see [15]). [sent-32, score-0.379]

17 The same method has also appeared in the statistical mechanical learning literature (see [1, 8, 11]), and is referred to as achieving optimal stability. [sent-33, score-0.067]

18 For non-separable problems, a generalization of optimal hyperplane was proposed in [2] by introducing a "soft-margin" loss term. [sent-34, score-0.15]

19 In this paper, we derive regularized Winnow methods by constructing "optimal hyperplanes" that minimize the Winnow mistake bound (rather than the Perceptron mistake bound as in an SVM). [sent-35, score-0.831]

20 We then derive a "soft-margin" version of the algorithms for non-separable problems. [sent-36, score-0.093]

21 For an SVM, a fixed threshold also allows a simple Perceptron like numerical algorithm as described in chapter 12 of [13], and in [7]. [sent-40, score-0.074]

22 In Section 2, we review mistake bounds for Perceptron and Winnow. [sent-43, score-0.18]

23 Based on the bounds, we show how regularized Winnow methods can be derived by mimicking the optimal stability method (and SVM) for Perceptron. [sent-44, score-0.467]

24 We also discuss the relationship of the newly derived methods with related methods. [sent-45, score-0.098]

25 In Section 3, we investigate learning aspects of the newly proposed methods in a context similar to some known SVM results. [sent-46, score-0.097]

26 1 From Perceptron to SVM We review the derivation of SVM from Perceptron, which serves as a reference for our derivation of regularized Winnow. [sent-49, score-0.381]

27 Consider linearly separable problems and let W be a weight that separates the in-class vectors from the out-of-class vectors in the training set. [sent-50, score-0.327]

28 It is well known that the Perceptron algorithm computes a weight that correctly classifies all training data after at most M updates (a proof can be found in [15]) where M = IIwll~ max; Ilxill~/(miI1i w T xi)2. [sent-51, score-0.234]

29 The weight vector w* that minimizes the right hand side of the bound is called the optimal hyperplane in [15] or the optimal stability hyperplane in [1, 8, 11]. [sent-52, score-0.543]

30 This optimal hyperplane is the solution to the following quadratic programming problem: . [sent-53, score-0.15]

31 We can write down the KKT condition for the above optimization problem, and let Qi be the Lagrangian multiplier for w T xiyi ~ 1 - f. [sent-64, score-0.159]

32 , we obtain the following dual optimization problem of the dual variable Q (see [15], chapter 10 for details): m;-x 2: Qi ~(2: Qixiyi)2 - i s. [sent-67, score-0.335]

33 To solve this problem, one can use the following modification of the Perceptron update algorithm (see [7] and chapter 12 of [13]): at each data point (xi, yi), we fix all Qk with k f:. [sent-76, score-0.212]

34 i, and update Qi to maximize the dual objective functional, which gives: Qi --+ max(min(C, Qi + 7](1 _ w T xiyi)), 0), where w = I:i Qixiyi. [sent-77, score-0.277]

35 The learning rate 7] can be set as 7] = to the exact maximization of the dual objective functional. [sent-78, score-0.203]

36 The proof of this specific bound can be found in [16] which employed techniques in [5] (also see [10] for earlier results). [sent-81, score-0.102]

37 Note that unlike the Perceptron mistake bound, the above bound is learning rate dependent. [sent-82, score-0.24]

38 For problems separable with positive weights, to obtain an optimal stability hyperplane associated with the Winnow mistake bound, we consider fixing Ilwlll such that Ilwlll = W > O. [sent-85, score-0.477]

39 It is then natural to define the optimal hyperplane as the (positive weight) solution to the following convex programming problem: . [sent-86, score-0.15]

40 i for each data point (xi, yi), and compute a weight vector w. [sent-96, score-0.132]

41 This implies that the derived methods are in fact regularized versions of the normalized Winnow. [sent-104, score-0.474]

42 One can also ignore this normalization constraint so that the derived methods correspond to regularized versions of the unnormalized Winnow. [sent-105, score-0.553]

43 The entropy regularization condition is natural to all exponentiated gradient methods [9], as can be observed from the theoretical results in [9]. [sent-106, score-0.258]

44 The regularized normalized Winnow is closely related to the maximum entropy discrimination [6] (the two methods are almost identical for linearly separable problems). [sent-107, score-0.656]

45 As we shall show later, it is possible to derive interesting learning bounds for our methods that are connected with the Winnow mistake bound. [sent-109, score-0.332]

46 Similar to the SVM formulation, the non-separable formulation of regularized Winnow approaches the separable formulation as C -+ 00. [sent-110, score-0.531]

47 Also similar to an SVM, we can write down the KKT condition and let o:i be the Lagrangian multiplier for w T xiyi ~ 1 - i . [sent-112, score-0.159]

48 After elimination of wand we obtain (the algebra resembles that of [15], chapter 10, which we shall skip due to the limitation of space) the following dual formulation for regularized unnormalized Winnow: e m. [sent-113, score-0.779]

49 The j-th component of weight w*(C) is given by w*(C)j = f-Lj exp(2:i o:ix~yi) at the optimal solution. [sent-122, score-0.139]

50 For regularized normalized Winnow with IIWllt = W > 0, we obtain m. [sent-123, score-0.377]

51 i The weight w*(C) is given by w*(C)j = Wf-Lj exp(2:i o:ix~yi)/ 2: j f-Lj exp(2:i o:ix~yi) at the optimal solution. [sent-130, score-0.139]

52 Similar to the Perceptron-like update rule for the dual SVM formulation, it is possible to derive Winnow-like update rules for the regularized Winnow formulations. [sent-131, score-0.757]

53 At each data point (xi, yi), we fix all O:k with k -# i, and update O:i to maximize the dual objective functionals. [sent-132, score-0.311]

54 We shall not try to derive an analytical solution, but rather use a gradient LD (O:i), where we use LD to denote ascent method with a learning rate TJ: O:i -+ O:i + TJ the dual objective function to be maximized. [sent-133, score-0.385]

55 It is not hard to verify that we obtain the following update rule for regularized unnormalized Winnow: o:i -+ max(min(C, o:i + TJ(1 - w T xiyi)), 0), where Wj = f-Lj exp(2:i o:ixjyi). [sent-135, score-0.6]

56 This gradient ascent on the dual variable gives an EGU rule as in [9]. [sent-136, score-0.255]

57 Compared WIth the SVM dual update rule which is a soft-margin version of the Perceptron update rule, this method naturally corresponds to a soft-margin version of unnormalized Winnow update. [sent-137, score-0.567]

58 Similarly, we obtain the following dual update rule for regularized normalized Winnow: a:, o:i -+ max(min(C, o:i + TJ(1 - w T xiyi)), 0), where Wj = Wf-Lj exp(2:i o:ix~yi)/ 2:j f-Lj exp(2:i o:ix~yi). [sent-138, score-0.665]

59 Again, this rule (which is an EG rule in [9]) can be naturally regarded as the soft-margin version of the normalized Winnow update. [sent-139, score-0.161]

60 In our experience, these update rules are numerically very efficient. [sent-140, score-0.1]

61 Note that for regularized normalized Winnow, the normalization constant W needs to be carefully chosen based on the data. [sent-141, score-0.377]

62 For example, if data is infinity-norm bounded by 1, then it T does not seem to be appropriate if we choose W ::; 1 since Iw x I ::; 1: a hyperplane with IIWllt ::; 1 does not achieve reasonable margin. [sent-142, score-0.109]

63 This problem is less crucial for unnormalized Winnow, but the norm of the initial weight f-Lj still affects the solution. [sent-143, score-0.223]

64 Besides maximum entropy discrimination which is closely related to regularized normalized Winnow, a large margin version of unnormalized Winnow has also been proposed based on some heuristics [3,4]. [sent-144, score-0.647]

65 However, their algorithm was purely mistake driven without dual variables o:i (the algorithm does not compute an optimal stability hyperplane for the Winnow mistake bound). [sent-145, score-0.602]

66 In addition, they did not include a regularization parameter C which in practice may be important for non-separable problems. [sent-146, score-0.084]

67 3 Some statistical properties of regularized Winnows In this section, we derive some learning bounds based on our formulations that minimize the Winnow mistake bound. [sent-147, score-0.575]

68 The following result is an analogy of a leave-one-out crossvalidation bound for separable SVMs - Theorem 10. [sent-148, score-0.232]

69 1 The expected misclassification error errn with the true distribution by using hyperplane W obtained from the linearly separable (C = 00) unnormalized regularized Winnow algorithm with n training samples is bounded by errn < n~l Emin(K, 1. [sent-151, score-0.864]

70 Let W be the optimal solution using all the samples with dual a i for i = 1, . [sent-157, score-0.185]

71 Let w k be the weight obtained from setting a k = 0, then W = max(llwI11, Ilwllll, . [sent-161, score-0.098]

72 Denote by illk the weight obtained from the optimal solution by removing (x k , yk) from the training sample. [sent-167, score-0.165]

73 7 in [15], we need to bound the leave-oneout cross-validation error, which is at most K. [sent-169, score-0.077]

74 For the dual formulation, by summing over index k of the KKT first order condition with respective to the dual a k , multiplied by a k , one obtains 2:k a k = 2: j Wj In~. [sent-175, score-0.312]

75 0 By using the same technique, we may also obtain a bound for regularized normalized Winnow. [sent-178, score-0.454]

76 One disadvantage of the above bound is that it is the expectation of a random estimator that is no better than the leave-one-out cross-validation error based on observed data. [sent-179, score-0.077]

77 However, the bound does convey some useful information: for example, we can observe that the expected misclassification error (learning curve) converges at a rate of 0 (1/ n) as long as W(2: j Wj In ~) and sup Ilxlloo are reasonably bounded. [sent-180, score-0.111]

78 It is also not difficult to obtain interesting PAC style bounds by using the covering number result for entropy regularization in [16] and ideas in [14]. [sent-181, score-0.203]

79 Although the PAC analysis would imply a slightly suboptimal learning curve of O(log n/n) for linearly separable problems, the bound itself provides a probability confidence and can be generalized to non-separable problems. [sent-182, score-0.268]

80 The bound itself is a direct consequence of Theorem 2. [sent-184, score-0.077]

81 2 and a covering number result with entropy regularization in [16]. [sent-185, score-0.16]

82 2 If the data is infinity-norm bounded as Ilxll oo : : : b, then consider the family r of hyperplanes W such that Ilwlll : : : a and 2: j Wj In(::III~\\~) ::::: c. [sent-188, score-0.07]

83 2 + - n C -2-b2(a2 "( n nab 1 "( TJ + ac) In(- + 2) + In-, where k-y = I{ i : w T xiyi < 'Y} I is the number afsamples with margin less than 'Y- 4 An example We use an artificial dataset to show that a regularized Winnow can enhance a Winnow just like an SVM can enhance a Perceptron. [sent-192, score-0.48]

84 In addition, it shows that for problems with many irrelevant features, the Winnow algorithms are superior to the Perceptron family algorithms. [sent-193, score-0.15]

85 The first 5 components of the target linear weight w are set to ones; the 6th component is -1; and the remaining components are zeros. [sent-196, score-0.098]

86 One thousand training and one thousand test data are generated. [sent-202, score-0.08]

87 We shall only consider balanced versions of the Winnows. [sent-203, score-0.119]

88 We use UWin and NWin to denote the basic unnormalized and normalized Winnows respectively. [sent-205, score-0.194]

89 For online algorithms, we fix the learning rates at 0. [sent-211, score-0.092]

90 For large margin Winnows, we use learning rates TJ = 0. [sent-213, score-0.064]

91 For (2-norm regularized) large margin Perceptron, we use the exact update which corresponds to a choice TJ 1/ xiT xi. [sent-215, score-0.138]

92 For regularization methods, accuracies are reported with the optimal regularization parameters. [sent-217, score-0.264]

93 The superiority of the regularized Winnows is obvious, especially for high dimensional data. [sent-218, score-0.331]

94 Accuracies of regularized algorithms with different regularization parameters are plotted in Figure 1. [sent-219, score-0.443]

95 In practice, the optimal regularization parameter can be found by cross-validation. [sent-221, score-0.125]

96 = n~ 5 Conclusion In this paper, we derived regularized versions of Winnow online update algorithms. [sent-248, score-0.526]

97 We studied algorithmic and theoretical properties of the newly obtained algorithms, and compared them to the Perceptron family algorithms. [sent-249, score-0.08]

98 Experimental results indicated that for problems with many irrelevant features, the Winnow family algorithms are superior to Perceptron family algorithms. [sent-250, score-0.193]

99 This is consistent with the implications from both the online learning theory, and learning bounds obtained in this paper. [sent-251, score-0.127]

100 Learning times of neural networks: Exact solution for a perceptron algorithm. [sent-315, score-0.232]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('winnow', 0.635), ('regularized', 0.331), ('wj', 0.323), ('perceptron', 0.232), ('dual', 0.144), ('mistake', 0.137), ('unnormalized', 0.125), ('separable', 0.118), ('xiyi', 0.111), ('ilwlll', 0.111), ('winnows', 0.111), ('hyperplane', 0.109), ('update', 0.1), ('weight', 0.098), ('svm', 0.098), ('yi', 0.09), ('regularization', 0.084), ('tj', 0.078), ('bound', 0.077), ('accuracies', 0.055), ('shall', 0.054), ('ix', 0.052), ('qi', 0.051), ('discriminant', 0.05), ('entropy', 0.049), ('max', 0.048), ('label', 0.048), ('kkt', 0.047), ('chapter', 0.047), ('linearly', 0.047), ('normalized', 0.046), ('ill', 0.045), ('rule', 0.044), ('bounds', 0.043), ('family', 0.043), ('optimal', 0.041), ('irrelevant', 0.041), ('formulation', 0.041), ('margin', 0.038), ('problems', 0.038), ('derive', 0.038), ('exponentiated', 0.037), ('newly', 0.037), ('cj', 0.037), ('ascent', 0.037), ('exp', 0.037), ('crossvalidation', 0.037), ('egu', 0.037), ('elimination', 0.037), ('errn', 0.037), ('iiwllt', 0.037), ('ilillk', 0.037), ('ilxill', 0.037), ('misclassifies', 0.037), ('qixiyi', 0.037), ('testset', 0.037), ('xit', 0.037), ('versions', 0.036), ('methods', 0.034), ('fix', 0.034), ('stability', 0.034), ('vector', 0.034), ('misclassification', 0.034), ('objective', 0.033), ('percentage', 0.032), ('updates', 0.032), ('online', 0.032), ('grove', 0.032), ('err', 0.032), ('ilxi', 0.032), ('ld', 0.032), ('multiplicative', 0.031), ('modification', 0.031), ('discrimination', 0.031), ('gradient', 0.03), ('classifies', 0.029), ('balanced', 0.029), ('xi', 0.028), ('algorithms', 0.028), ('threshold', 0.027), ('derived', 0.027), ('version', 0.027), ('hyperplanes', 0.027), ('thousand', 0.027), ('covering', 0.027), ('learning', 0.026), ('training', 0.026), ('additive', 0.025), ('derivation', 0.025), ('proof', 0.025), ('theorem', 0.025), ('condition', 0.024), ('correctly', 0.024), ('lagrangian', 0.024), ('tong', 0.024), ('multiplier', 0.024), ('ibm', 0.024), ('denote', 0.023), ('solves', 0.023), ('pac', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 111 nips-2000-Regularized Winnow Methods

Author: Tong Zhang

Abstract: In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.

2 0.23319277 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

3 0.20312481 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

4 0.14035615 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

5 0.11708416 133 nips-2000-The Kernel Gibbs Sampler

Author: Thore Graepel, Ralf Herbrich

Abstract: We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise. 1

6 0.11503868 75 nips-2000-Large Scale Bayes Point Machines

7 0.10138841 18 nips-2000-Active Support Vector Machine Classification

8 0.084813029 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

9 0.079599686 44 nips-2000-Efficient Learning of Linear Perceptrons

10 0.078834735 54 nips-2000-Feature Selection for SVMs

11 0.075033695 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

12 0.068735704 21 nips-2000-Algorithmic Stability and Generalization Performance

13 0.064793557 74 nips-2000-Kernel Expansions with Unlabeled Examples

14 0.06144705 4 nips-2000-A Linear Programming Approach to Novelty Detection

15 0.059225451 70 nips-2000-Incremental and Decremental Support Vector Machine Learning

16 0.058935091 13 nips-2000-A Tighter Bound for Graphical Models

17 0.057950702 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

18 0.05785729 110 nips-2000-Regularization with Dot-Product Kernels

19 0.056474876 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

20 0.050773989 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.196), (1, 0.206), (2, -0.109), (3, -0.018), (4, -0.049), (5, -0.168), (6, 0.02), (7, -0.012), (8, 0.018), (9, 0.012), (10, -0.043), (11, -0.048), (12, -0.028), (13, 0.0), (14, 0.061), (15, -0.113), (16, -0.168), (17, -0.047), (18, -0.038), (19, 0.026), (20, 0.148), (21, 0.078), (22, -0.077), (23, 0.203), (24, -0.078), (25, -0.089), (26, 0.083), (27, -0.178), (28, -0.09), (29, -0.088), (30, 0.04), (31, -0.088), (32, -0.025), (33, 0.141), (34, -0.153), (35, 0.064), (36, -0.086), (37, 0.056), (38, -0.152), (39, 0.096), (40, 0.028), (41, 0.002), (42, -0.013), (43, -0.038), (44, 0.004), (45, -0.056), (46, 0.15), (47, 0.084), (48, 0.053), (49, -0.1)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95317715 111 nips-2000-Regularized Winnow Methods

Author: Tong Zhang

Abstract: In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.

2 0.7987538 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

3 0.67086309 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

4 0.51791769 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

5 0.4559924 44 nips-2000-Efficient Learning of Linear Perceptrons

Author: Shai Ben-David, Hans-Ulrich Simon

Abstract: We consider the existence of efficient algorithms for learning the class of half-spaces in ~n in the agnostic learning model (Le., making no prior assumptions on the example-generating distribution). The resulting combinatorial problem - finding the best agreement half-space over an input sample - is NP hard to approximate to within some constant factor. We suggest a way to circumvent this theoretical bound by introducing a new measure of success for such algorithms. An algorithm is IL-margin successful if the agreement ratio of the half-space it outputs is as good as that of any half-space once training points that are inside the IL-margins of its separating hyper-plane are disregarded. We prove crisp computational complexity results with respect to this success measure: On one hand, for every positive IL, there exist efficient (poly-time) IL-margin successful learning algorithms. On the other hand, we prove that unless P=NP, there is no algorithm that runs in time polynomial in the sample size and in 1/ IL that is IL-margin successful for all IL> O. 1

6 0.44587696 18 nips-2000-Active Support Vector Machine Classification

7 0.3585186 74 nips-2000-Kernel Expansions with Unlabeled Examples

8 0.32335368 70 nips-2000-Incremental and Decremental Support Vector Machine Learning

9 0.30845657 22 nips-2000-Algorithms for Non-negative Matrix Factorization

10 0.28386933 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

11 0.28109357 54 nips-2000-Feature Selection for SVMs

12 0.26150402 75 nips-2000-Large Scale Bayes Point Machines

13 0.256515 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

14 0.23784557 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

15 0.21829353 133 nips-2000-The Kernel Gibbs Sampler

16 0.21219654 60 nips-2000-Gaussianization

17 0.20545393 110 nips-2000-Regularization with Dot-Product Kernels

18 0.20511365 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

19 0.2022032 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

20 0.18680418 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.039), (17, 0.112), (32, 0.019), (33, 0.079), (34, 0.233), (48, 0.012), (49, 0.015), (54, 0.011), (55, 0.02), (62, 0.045), (65, 0.029), (67, 0.072), (76, 0.04), (77, 0.04), (81, 0.013), (90, 0.093), (97, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93274206 126 nips-2000-Stagewise Processing in Error-correcting Codes and Image Restoration

Author: K. Y. Michael Wong, Hidetoshi Nishimori

Abstract: We introduce stagewise processing in error-correcting codes and image restoration, by extracting information from the former stage and using it selectively to improve the performance of the latter one. Both mean-field analysis using the cavity method and simulations show that it has the advantage of being robust against uncertainties in hyperparameter estimation. 1

same-paper 2 0.83581388 111 nips-2000-Regularized Winnow Methods

Author: Tong Zhang

Abstract: In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.

3 0.61041832 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

4 0.60554475 100 nips-2000-Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks

Author: Richard H. R. Hahnloser, H. Sebastian Seung

Abstract: Ascribing computational principles to neural feedback circuits is an important problem in theoretical neuroscience. We study symmetric threshold-linear networks and derive stability results that go beyond the insights that can be gained from Lyapunov theory or energy functions. By applying linear analysis to subnetworks composed of coactive neurons, we determine the stability of potential steady states. We find that stability depends on two types of eigenmodes. One type determines global stability and the other type determines whether or not multistability is possible. We can prove the equivalence of our stability criteria with criteria taken from quadratic programming. Also, we show that there are permitted sets of neurons that can be coactive at a steady state and forbidden sets that cannot. Permitted sets are clustered in the sense that subsets of permitted sets are permitted and supersets of forbidden sets are forbidden. By viewing permitted sets as memories stored in the synaptic connections, we can provide a formulation of longterm memory that is more general than the traditional perspective of fixed point attractor networks. A Lyapunov-function can be used to prove that a given set of differential equations is convergent. For example, if a neural network possesses a Lyapunov-function, then for almost any initial condition, the outputs of the neurons converge to a stable steady state. In the past, this stability-property was used to construct attractor networks that associatively recall memorized patterns. Lyapunov theory applies mainly to symmetric networks in which neurons have monotonic activation functions [1, 2]. Here we show that the restriction of activation functions to threshold-linear ones is not a mere limitation, but can yield new insights into the computational behavior of recurrent networks (for completeness, see also [3]). We present three main theorems about the neural responses to constant inputs. The first theorem provides necessary and sufficient conditions on the synaptic weight matrix for the existence of a globally asymptotically stable set of fixed points. These conditions can be expressed in terms of copositivity, a concept from quadratic programming and linear complementarity theory. Alternatively, they can be expressed in terms of certain eigenvalues and eigenvectors of submatrices of the synaptic weight matrix, making a connection to linear systems theory. The theorem guarantees that the network will produce a steady state response to any constant input. We regard this response as the computational output of the network, and its characterization is the topic of the second and third theorems. In the second theorem, we introduce the idea of permitted and forbidden sets. Under certain conditions on the synaptic weight matrix, we show that there exist sets of neurons that are

5 0.60283202 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

6 0.59576857 81 nips-2000-Learning Winner-take-all Competition Between Groups of Neurons in Lateral Inhibitory Networks

7 0.59267598 74 nips-2000-Kernel Expansions with Unlabeled Examples

8 0.58754164 21 nips-2000-Algorithmic Stability and Generalization Performance

9 0.58247113 52 nips-2000-Fast Training of Support Vector Classifiers

10 0.58090544 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

11 0.57635701 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

12 0.57536054 4 nips-2000-A Linear Programming Approach to Novelty Detection

13 0.56813556 75 nips-2000-Large Scale Bayes Point Machines

14 0.56800431 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

15 0.56351489 44 nips-2000-Efficient Learning of Linear Perceptrons

16 0.56332743 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

17 0.56108022 36 nips-2000-Constrained Independent Component Analysis

18 0.56064194 79 nips-2000-Learning Segmentation by Random Walks

19 0.56038553 133 nips-2000-The Kernel Gibbs Sampler

20 0.55797887 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition