nips nips2000 nips2000-37 knowledge-graph by maker-knowledge-mining

37 nips-2000-Convergence of Large Margin Separable Linear Classification


Source: pdf

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Large margin linear classification methods have been successfully applied to many applications. [sent-4, score-0.347]

2 For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed "optimal hyperplane" approaches zero at a rate proportional to the inverse training sample size. [sent-5, score-0.996]

3 This rate is usually characterized by the margin and the maximum norm of the input data. [sent-6, score-0.31]

4 In this paper, we argue that another quantity, namely the robustness of the input data distribution, also plays an important role in characterizing the convergence behavior of expected misclassification error. [sent-7, score-0.573]

5 Based on this concept of robustness, we show that for a large margin separable linear classification problem, the expected misclassification error may converge exponentially in the number of training sample size. [sent-8, score-1.221]

6 1 Introduction We consider the binary classification problem: to determine a label y E {-1, 1} associated with an input vector x. [sent-9, score-0.154]

7 Specifically, we seek a weight vector wand a threshold () such that w T x < () if its label y = -1 and w T x ~ () if its label y = 1. [sent-11, score-0.176]

8 In this paper, we are mainly interested in problems that are linearly separable by a positive margin (although, as we shall see later, our analysis is suitable for non-separable problems). [sent-12, score-0.952]

9 That is, there exists a hyperplane that perfectly separates the in-class data from the out-ofclass data. [sent-13, score-0.304]

10 We shall also assume () = 0 throughout the rest of the paper for simplicity. [sent-14, score-0.098]

11 This restriction usually does not cause problems in practice since one can always append a constant feature to the input data x, which offset the effect of (). [sent-15, score-0.063]

12 For linearly separable problems, given a training set of n labeled data (X1,yl), . [sent-16, score-0.478]

13 ,(xn,yn), Vapnik recently proposed a method that optimizes a hard margin bound which he calls the "optimal hyperplane" method (see [11]). [sent-19, score-0.363]

14 The optimal hyperplane Wn is the solution to the following quadratic programming problem: . [sent-20, score-0.246]

15 1 mln-w w 2 2 (1) For linearly non-separable problems, a generalization of the optimal hyperplane method has appeared in [2], where a slack variable f. [sent-21, score-0.467]

16 We compute a hyperplane Wn that solves min~wTw+CLf. [sent-26, score-0.213]

17 Where C > 0 is a given parameter (also see [11]). [sent-31, score-0.069]

18 ,no (2) In this paper, we are interested in the quality of the computed weight Wn for the purpose of predicting the label y of an unseen data point x. [sent-36, score-0.154]

19 We study this predictive power of Wn in the standard batch learning framework. [sent-37, score-0.132]

20 The predictive power of the computed parameter Wn then corresponds to the classification performance of Wn with respect to the true distribution D. [sent-42, score-0.264]

21 In Section 2, we briefly review a number of existing techniques for analyzing separable linear classification problems. [sent-44, score-0.476]

22 We then derive an exponential convergence rate of misclassification error in Section 3 for certain large margin linear classification. [sent-45, score-0.924]

23 Section 4 compares the newly derived bound with known results from the traditional margin analysis. [sent-46, score-0.406]

24 We explain that the exponential bound relies on a new quantity (the robustness of the distribution) which is not explored in a traditional margin bound. [sent-47, score-0.708]

25 Note that for certain batch learning problems, exponential learning curves have already been observed [10]. [sent-48, score-0.228]

26 It is thus not surprising that an exponential rate of convergence can be achieved by large margin linear classification. [sent-49, score-0.618]

27 2 Some known results on generalization analysis There are a number of ways to obtain bounds on the generalization error of a linear classifier. [sent-50, score-0.437]

28 Many such results that are related to large margin classification have been described in chapter 4 of [3]. [sent-52, score-0.397]

29 The analysis does not require the estimated parameter to converge to the true parameter, which is ideal for combinatorial problems. [sent-54, score-0.195]

30 However, for problems that are numerical in natural, the potential parameter space can be significantly reduced by using the first order condition of the optimal solution. [sent-55, score-0.195]

31 Generally speaking, for a problem that is linearly separable with a large margin, the expected classification error of the computed hyperplane resulted from this analysis is of the order Oeo~n V Similar generalization bounds can also be obtained for non-separable problems. [sent-57, score-1.178]

32 In chapter 10 of [11], Vapnik described a leave-one-out cross-validation analysis for linearly separable problems. [sent-58, score-0.58]

33 This analysis takes into account the first order KKT condition of the optimal hyperplane W n . [sent-59, score-0.324]

34 The expected generalization performance from this analysis is O( ~) , which is better than the corresponding bounds from the VC analysis. [sent-60, score-0.29]

35 Unfortunately, this technique is only suitable for deriving an expected generalization bound (for example, it is not useful for obtaining a PAC style probability bound). [sent-61, score-0.389]

36 Another well-known technique for analyzing linearly separable problems is the mistake bound framework in online learning. [sent-62, score-0.792]

37 The technique may lead to a bound with an expected generalization performance of O(~). [sent-66, score-0.314]

38 Besides the above mentioned approaches, generalization ability can also be studied in the statistical mechanical learning framework. [sent-67, score-0.141]

39 It was shown that for linearly separable problems, exponential decrease of misclassification error is possible under this framework [1, 5, 7, 8]. [sent-68, score-0.944]

40 Unfortunately, it is unclear how to relate the statistical mechanical learning framework to the batch learning framework considered in this paper. [sent-69, score-0.224]

41 Their analysis, employing approximation techniques, does not seem to imply small sample bounds which we are interested in. [sent-70, score-0.203]

42 The statistical mechanical learning result suggests that it may be possible to obtain a similar exponential decay of misclassification error in the batch learning setting, which we prove in the next section. [sent-71, score-0.688]

43 Furthermore, we show that the exponential rate depends on a quantity that is different than the traditional margin concept. [sent-72, score-0.545]

44 Our analysis relies on a PAC style probability estimate on the convergence rate of the estimated parameter from (2) to the true parameter. [sent-73, score-0.418]

45 A direct analysis on the convergence rate of the estimated parameter to the true parameter is important for problems that are numerical in nature such as (2). [sent-75, score-0.506]

46 However, a disadvantage of our analysis is that we are unable to directly deal with the linearly separable formulation (1). [sent-76, score-0.526]

47 3 Exponential convergence We can rewrite the SVM formulation (2) by eliminating 12: f(w T" y' x' Wn(A) = argmin w n eas: i AT 1) + -w w, 2 (3) where A = 1/(nC) and z z :S 0, > O. [sent-77, score-0.114]

48 (A) be the optimal solution with respect to the true distribution as : W. [sent-79, score-0.098]

49 EDf(w T xy - 1) = 0, (5) which is the infinite-sample version of the optimal hyperplane method. [sent-85, score-0.554]

50 The latter condition ensures that EDf( w T xy - 1) :S IIwl12ED Ilx 112 + 1 exists for all w. [sent-88, score-0.391]

51 1 Continuity of solution under regularization In this section, we show that Ilw. [sent-90, score-0.078]

52 us to approximate (5) by using (4) and (3) with a small positive regularization parameter A. [sent-93, score-0.176]

53 We only need to show that within any sequence of A that converges to zero, there exists a subsequence Ai -+ 0 such that w. [sent-94, score-0.196]

54 112' It is well-known that every bounded sequence in a Hilbert space contains a weakly convergent subsequence (cf. [sent-104, score-0.066]

55 Therefore within any sequence of A that converges to zero, there exists a subsequence Ai --t 0 such that W. [sent-107, score-0.196]

56 11211x112 + 1 which has a finite integral with respect to D, therefore from (6) and the Lebesgue dominated convergence theorem, we obtain 0= lim ED f(w. [sent-112, score-0.291]

57 2 Accuracy of estimated hyperplane with non-zero regularization parameter Our goal is to show that for the estimation method (3) with a nonzero regularization parameter A > 0, the estimated parameter Wn(A) converges to the true parameter W. [sent-136, score-0.842]

58 Furthermore, we give a large deviation bound on the rate of convergence. [sent-138, score-0.215]

59 (A)T xy - 1) and f'(z) E [-1,0] denotes a member of the subgradient of f at z [9]. [sent-141, score-0.364]

60 2 In the finite sample case, we can also interpret f3(\ x, y) in (8) as a scaled dual variable a: f3 = -a/C, where a appears in the dual (or Kernel) formulation of an SVM (for example, see chapter 10 of [11]). [sent-142, score-0.089]

61 This implies the following inequality: :s f(Z2) for any subgradient f' of ~ L f(W. [sent-144, score-0.092]

62 (A) - Wn (A))2 2 2Por readers not familiar with the sub gradient concept in convex analysis, our analysis requires little modification if we replace f with a smoother convex function such as P, which avoids the discontinuity in the first order derivative. [sent-158, score-0.251]

63 , (9) Note that in (9), we have already bounded the convergence of Wn(A) to W*(A) in terms of the convergence of the empirical expectation of a random vector ,B( A, x, y)xy to its mean. [sent-183, score-0.228]

64 In order to obtain a large deviation bound on the convergence rate, we need the following result which can be found in [13], page 95: Theorem 3. [sent-184, score-0.301]

65 If there exists M > 0 such thatforall natural numbers 12: 2: E~=l Elleill~ :S ";bl! [sent-186, score-0.085]

66 U sing the fact that ,B( A, x, y) E [-1, 0], it is easy to verify the following corollary by using Theorem 3. [sent-191, score-0.121]

67 1 and (9), where we also bound the l-th moment of the right hand side of (9) using the following form ofJensen's inequality: la + bll :S 2l-1( lall + Ibll) forl 2: 2. [sent-192, score-0.118]

68 ) denote the probability with respect to distribution D, then the following bound on the expected misclassification error of the computed hyperplane Wn (A) is a straightforward consequence of Corollary 3. [sent-203, score-0.778]

69 We now consider linearly separable classification problems where the solution W* of (5) is finite. [sent-208, score-0.607]

70 (A) also separates the in-class data from the outof-class data with a margin of at least 2(1 - Mllw. [sent-214, score-0.283]

71 2, we obtain the following upper-bound on the misclassification error if we compute a linear separator from (3) with a non-zero small regularization parameter A: Ex Pn( wn(Af xy :s 0) :s 2 exp( - ~A21'(A)2 1(4M4 + AI'(A)M2)). [sent-222, score-0.863]

72 This indicates that the expected misclassification error of an appropriately computed hyperplane for a linearly separable problem is exponential in n. [sent-223, score-1.267]

73 However, the rate of convergence depends on AI'( A) 1M2. [sent-224, score-0.179]

74 This quantity is different than the margin concept which has been widely used in the literature to characterize the generalization behavior of a linear classification problem. [sent-225, score-0.579]

75 The new quantity measures the convergence rate of W. [sent-226, score-0.245]

76 The faster the convergence, the more "robust" the linear classification problem is, and hence the faster the exponential decay of misclassification error is. [sent-229, score-0.597]

77 As we shall see in the next section, this "robustness" is related to the degree of outliers in the problem. [sent-230, score-0.158]

78 4 Example We give an example to illustrate the "robustness" concept that characterizes the exponential decay of misclassification error. [sent-231, score-0.547]

79 It is known from Vapnik's cross-validation bound in [11] (Theorem 10. [sent-232, score-0.118]

80 7) that by using the large margin idea alone, one can derive an expected misclassification error bound that is of the order O(l/n), where the constant is margin dependent. [sent-233, score-1.024]

81 We show that this bound is tight by using the following example. [sent-234, score-0.118]

82 Assume that with probability of 1-1', we observe a data point x with label y such that xy = [1, 0]; and with probability of 1', we observe a data point x with label y such that xy = [-1, 1]. [sent-237, score-0.854]

83 This problem is obviously linearly separable with a large margin that is I' independent. [sent-238, score-0.755]

84 Now, for n random training data, with probability at most I'n + (1- I')n, we observe either xiyi = [1,0] for all i = 1, . [sent-239, score-0.274]

85 For all other cases, the computed optimal hyperplane Wn = w •. [sent-246, score-0.28]

86 This means that the misclassification error is 1'(1 - I')("Yn-l + (1 - I')n-l). [sent-247, score-0.306]

87 This error converges to zero exponentially as n -+ 00. [sent-248, score-0.142]

88 However the convergence rate depends on the fraction of outliers in the distribution characterized by 1'. [sent-249, score-0.309]

89 In particular, for any n, if we let I' = 1In, then we have an expected misclassification error that is at least ~(l-l/n)n ~ 1/(en). [sent-250, score-0.384]

90 D The above tightness construction of the linear decay rate of the expected generalization error (using the margin concept alone) requires the scenario that a small fraction (which shall be in the order of inverse sample size) of data are very different from other data. [sent-251, score-0.864]

91 This small portion of data can be considered as outliers, which can be measured by the "robustness" of the distribution. [sent-252, score-0.059]

92 It can be seen that the optimal hyperplane in (1) is quite sensitive to even a single outlier. [sent-256, score-0.246]

93 However, the previous large margin learning bounds seemed to have dismissed this concern. [sent-258, score-0.354]

94 In the worst case, even if the problem is separable by a large margin, outliers can still cause a slow down of the exponential convergence rate. [sent-260, score-0.719]

95 5 Conclusion In this paper, we derived new generalization bounds for large margin linearly separable classification. [sent-261, score-0.919]

96 Even though we have only discussed the consequence of this analysis for separable problems, the technique can be easily applied to non separable problems (see Corollary 3. [sent-262, score-0.83]

97 For large margin separable problems, we show that exponential decay of generalization error may be achieved with an appropriately chosen regularization parameter. [sent-264, score-1.072]

98 However, the bound depends on a quantity which characterizes the robustness of the distribution. [sent-265, score-0.332]

99 An important difference of the robustness concept and the margin concept is that outliers may not be observable with large probability from data while margin generally will. [sent-266, score-0.893]

100 This implies that without any prior knowledge, it could be difficult to directly apply our bound using only the observed data. [sent-267, score-0.154]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wn', 0.379), ('separable', 0.344), ('xy', 0.308), ('margin', 0.245), ('xiyi', 0.243), ('misclassification', 0.241), ('hyperplane', 0.213), ('edf', 0.141), ('linearly', 0.134), ('exponential', 0.126), ('corollary', 0.121), ('bound', 0.118), ('convergence', 0.114), ('robustness', 0.11), ('ai', 0.11), ('outliers', 0.103), ('batch', 0.102), ('label', 0.088), ('generalization', 0.087), ('liffii', 0.085), ('af', 0.082), ('concept', 0.079), ('expected', 0.078), ('regularization', 0.078), ('bounds', 0.077), ('converges', 0.077), ('parameter', 0.069), ('classification', 0.066), ('quantity', 0.066), ('subsequence', 0.066), ('rate', 0.065), ('error', 0.065), ('yi', 0.064), ('decay', 0.063), ('problems', 0.063), ('subgradient', 0.056), ('thenfor', 0.056), ('shall', 0.055), ('chapter', 0.054), ('vc', 0.054), ('mechanical', 0.054), ('exists', 0.053), ('aw', 0.052), ('zd', 0.049), ('analysis', 0.048), ('vapnik', 0.046), ('continuity', 0.044), ('readers', 0.044), ('style', 0.044), ('perceptron', 0.043), ('throughout', 0.043), ('traditional', 0.043), ('estimated', 0.042), ('ed', 0.041), ('john', 0.041), ('pd', 0.041), ('dominated', 0.041), ('convex', 0.04), ('mistake', 0.038), ('characterizes', 0.038), ('separates', 0.038), ('xi', 0.038), ('obtain', 0.037), ('tong', 0.036), ('lim', 0.036), ('ibm', 0.036), ('implies', 0.036), ('linear', 0.036), ('inequality', 0.036), ('true', 0.036), ('sample', 0.035), ('pac', 0.034), ('sons', 0.034), ('computed', 0.034), ('framework', 0.034), ('therefore', 0.034), ('optimal', 0.033), ('numbers', 0.032), ('interested', 0.032), ('large', 0.032), ('appropriately', 0.032), ('observe', 0.031), ('suitable', 0.031), ('technique', 0.031), ('portion', 0.03), ('predictive', 0.03), ('characterizing', 0.03), ('analyzing', 0.03), ('imply', 0.03), ('princeton', 0.03), ('theorem', 0.03), ('condition', 0.03), ('small', 0.029), ('hilbert', 0.029), ('pn', 0.029), ('respect', 0.029), ('section', 0.028), ('wt', 0.027), ('fraction', 0.027), ('alone', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

2 0.23319277 111 nips-2000-Regularized Winnow Methods

Author: Tong Zhang

Abstract: In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.

3 0.2182288 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

4 0.20437849 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

5 0.19187516 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

6 0.1805305 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

7 0.13483235 21 nips-2000-Algorithmic Stability and Generalization Performance

8 0.13123235 54 nips-2000-Feature Selection for SVMs

9 0.11464895 4 nips-2000-A Linear Programming Approach to Novelty Detection

10 0.10792703 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

11 0.10509069 75 nips-2000-Large Scale Bayes Point Machines

12 0.09852583 44 nips-2000-Efficient Learning of Linear Perceptrons

13 0.098466963 13 nips-2000-A Tighter Bound for Graphical Models

14 0.094795763 133 nips-2000-The Kernel Gibbs Sampler

15 0.086309493 94 nips-2000-On Reversing Jensen's Inequality

16 0.08626692 74 nips-2000-Kernel Expansions with Unlabeled Examples

17 0.083701074 18 nips-2000-Active Support Vector Machine Classification

18 0.080941312 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

19 0.078662336 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

20 0.07276649 49 nips-2000-Explaining Away in Weight Space


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.257), (1, 0.273), (2, -0.134), (3, -0.017), (4, -0.045), (5, -0.232), (6, 0.039), (7, -0.007), (8, 0.006), (9, 0.002), (10, -0.098), (11, 0.017), (12, -0.014), (13, 0.029), (14, 0.105), (15, -0.115), (16, -0.055), (17, 0.018), (18, -0.122), (19, 0.03), (20, 0.215), (21, 0.056), (22, -0.041), (23, 0.121), (24, -0.041), (25, -0.01), (26, 0.105), (27, -0.195), (28, -0.05), (29, -0.006), (30, 0.002), (31, -0.114), (32, -0.048), (33, 0.071), (34, -0.046), (35, 0.063), (36, -0.031), (37, 0.096), (38, -0.164), (39, 0.015), (40, 0.009), (41, -0.013), (42, -0.004), (43, -0.065), (44, 0.019), (45, -0.021), (46, 0.071), (47, -0.001), (48, 0.044), (49, -0.056)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96002525 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

2 0.8749373 111 nips-2000-Regularized Winnow Methods

Author: Tong Zhang

Abstract: In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.

3 0.7871387 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

4 0.64247155 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

5 0.50029373 44 nips-2000-Efficient Learning of Linear Perceptrons

Author: Shai Ben-David, Hans-Ulrich Simon

Abstract: We consider the existence of efficient algorithms for learning the class of half-spaces in ~n in the agnostic learning model (Le., making no prior assumptions on the example-generating distribution). The resulting combinatorial problem - finding the best agreement half-space over an input sample - is NP hard to approximate to within some constant factor. We suggest a way to circumvent this theoretical bound by introducing a new measure of success for such algorithms. An algorithm is IL-margin successful if the agreement ratio of the half-space it outputs is as good as that of any half-space once training points that are inside the IL-margins of its separating hyper-plane are disregarded. We prove crisp computational complexity results with respect to this success measure: On one hand, for every positive IL, there exist efficient (poly-time) IL-margin successful learning algorithms. On the other hand, we prove that unless P=NP, there is no algorithm that runs in time polynomial in the sample size and in 1/ IL that is IL-margin successful for all IL> O. 1

6 0.4856759 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

7 0.45805192 54 nips-2000-Feature Selection for SVMs

8 0.45603514 21 nips-2000-Algorithmic Stability and Generalization Performance

9 0.45038587 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

10 0.42252091 18 nips-2000-Active Support Vector Machine Classification

11 0.39022094 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

12 0.35188493 74 nips-2000-Kernel Expansions with Unlabeled Examples

13 0.32544714 13 nips-2000-A Tighter Bound for Graphical Models

14 0.30419859 4 nips-2000-A Linear Programming Approach to Novelty Detection

15 0.30171734 94 nips-2000-On Reversing Jensen's Inequality

16 0.29835257 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

17 0.28943142 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

18 0.28846157 75 nips-2000-Large Scale Bayes Point Machines

19 0.25614557 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

20 0.25002274 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.038), (17, 0.139), (32, 0.032), (33, 0.103), (36, 0.014), (49, 0.011), (55, 0.016), (62, 0.03), (65, 0.021), (67, 0.08), (76, 0.073), (77, 0.251), (81, 0.015), (90, 0.053), (97, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89095438 100 nips-2000-Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks

Author: Richard H. R. Hahnloser, H. Sebastian Seung

Abstract: Ascribing computational principles to neural feedback circuits is an important problem in theoretical neuroscience. We study symmetric threshold-linear networks and derive stability results that go beyond the insights that can be gained from Lyapunov theory or energy functions. By applying linear analysis to subnetworks composed of coactive neurons, we determine the stability of potential steady states. We find that stability depends on two types of eigenmodes. One type determines global stability and the other type determines whether or not multistability is possible. We can prove the equivalence of our stability criteria with criteria taken from quadratic programming. Also, we show that there are permitted sets of neurons that can be coactive at a steady state and forbidden sets that cannot. Permitted sets are clustered in the sense that subsets of permitted sets are permitted and supersets of forbidden sets are forbidden. By viewing permitted sets as memories stored in the synaptic connections, we can provide a formulation of longterm memory that is more general than the traditional perspective of fixed point attractor networks. A Lyapunov-function can be used to prove that a given set of differential equations is convergent. For example, if a neural network possesses a Lyapunov-function, then for almost any initial condition, the outputs of the neurons converge to a stable steady state. In the past, this stability-property was used to construct attractor networks that associatively recall memorized patterns. Lyapunov theory applies mainly to symmetric networks in which neurons have monotonic activation functions [1, 2]. Here we show that the restriction of activation functions to threshold-linear ones is not a mere limitation, but can yield new insights into the computational behavior of recurrent networks (for completeness, see also [3]). We present three main theorems about the neural responses to constant inputs. The first theorem provides necessary and sufficient conditions on the synaptic weight matrix for the existence of a globally asymptotically stable set of fixed points. These conditions can be expressed in terms of copositivity, a concept from quadratic programming and linear complementarity theory. Alternatively, they can be expressed in terms of certain eigenvalues and eigenvectors of submatrices of the synaptic weight matrix, making a connection to linear systems theory. The theorem guarantees that the network will produce a steady state response to any constant input. We regard this response as the computational output of the network, and its characterization is the topic of the second and third theorems. In the second theorem, we introduce the idea of permitted and forbidden sets. Under certain conditions on the synaptic weight matrix, we show that there exist sets of neurons that are

2 0.83951288 105 nips-2000-Programmable Reinforcement Learning Agents

Author: David Andre, Stuart J. Russell

Abstract: We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn't specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills.

same-paper 3 0.83912075 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

4 0.69725758 81 nips-2000-Learning Winner-take-all Competition Between Groups of Neurons in Lateral Inhibitory Networks

Author: Xiaohui Xie, Richard H. R. Hahnloser, H. Sebastian Seung

Abstract: It has long been known that lateral inhibition in neural networks can lead to a winner-take-all competition, so that only a single neuron is active at a steady state. Here we show how to organize lateral inhibition so that groups of neurons compete to be active. Given a collection of potentially overlapping groups, the inhibitory connectivity is set by a formula that can be interpreted as arising from a simple learning rule. Our analysis demonstrates that such inhibition generally results in winner-take-all competition between the given groups, with the exception of some degenerate cases. In a broader context, the network serves as a particular illustration of the general distinction between permitted and forbidden sets, which was introduced recently. From this viewpoint, the computational function of our network is to store and retrieve memories as permitted sets of coactive neurons. In traditional winner-take-all networks, lateral inhibition is used to enforce a localized, or

5 0.64843816 111 nips-2000-Regularized Winnow Methods

Author: Tong Zhang

Abstract: In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.

6 0.61129612 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

7 0.60365266 74 nips-2000-Kernel Expansions with Unlabeled Examples

8 0.60273528 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

9 0.60000402 21 nips-2000-Algorithmic Stability and Generalization Performance

10 0.59719038 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

11 0.58541381 4 nips-2000-A Linear Programming Approach to Novelty Detection

12 0.58464122 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

13 0.58058333 60 nips-2000-Gaussianization

14 0.58057749 75 nips-2000-Large Scale Bayes Point Machines

15 0.58028245 133 nips-2000-The Kernel Gibbs Sampler

16 0.57981849 122 nips-2000-Sparse Representation for Gaussian Process Models

17 0.57854217 36 nips-2000-Constrained Independent Component Analysis

18 0.577438 130 nips-2000-Text Classification using String Kernels

19 0.57715791 79 nips-2000-Learning Segmentation by Random Walks

20 0.57547474 94 nips-2000-On Reversing Jensen's Inequality