nips nips2000 nips2000-21 knowledge-graph by maker-knowledge-mining

21 nips-2000-Algorithmic Stability and Generalization Performance


Source: pdf

Author: Olivier Bousquet, Andr茅 Elisseeff

Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Algorithmic Stability and Generalization Performance Olivier Bousquet CMAP Ecole Polytechnique F-91128 Palaiseau cedex FRANCE bousquet@cmapx. [sent-1, score-0.152]

2 com Abstract We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. [sent-3, score-1.104]

3 A stable learner is one for which the learned solution does not change much with small changes in the training set. [sent-4, score-0.14]

4 The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e. [sent-5, score-0.614]

5 VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. [sent-7, score-0.384]

6 We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. [sent-8, score-0.898]

7 1 Introduction A key issue in computational learning theory is to bound the generalization error of learning algorithms. [sent-9, score-0.468]

8 Until recently, most of the research in that area has focused on uniform a-priori bounds giving a guarantee that the difference between the training error and the test error is uniformly small for any hypothesis in a given class. [sent-10, score-0.851]

9 These bounds are usually expressed in terms of combinatorial quantities such as VCdimension. [sent-11, score-0.397]

10 In the last few years, researchers have tried to use more refined quantities to either estimate the complexity of the search space (e. [sent-12, score-0.414]

11 covering numbers [1]) or to use a posteriori information about the solution found by the algorithm (e. [sent-14, score-0.285]

12 There exist other approaches such as observed VC dimension [12], but all are concerned with structural properties of the learning systems. [sent-17, score-0.236]

13 In this paper we present a novel way of obtaining PAC bounds for specific algorithms explicitly using their stability properties. [sent-18, score-0.798]

14 This method has the nice advantage of providing bounds that do -This work was done while the author was at Laboratoire ERIC, Universite Lumiere Lyon 2,5 avenue Pierre Mendes-France, F-69676 Bron cedex, FRANCE not depend on any complexity measure of the search space (e. [sent-20, score-0.801]

15 VC-dimension or covering numbers) but rather on the way the algorithm searches this space. [sent-22, score-0.303]

16 In that respect, our approach can be related to Freund's [7] where the estimated size of the subset of the hypothesis space actually searched by the algorithm is used to bound its generalization error. [sent-23, score-0.569]

17 However Freund's result depends on a complexity term which we do not have here since we are not looking separately at the hypotheses considered by the algorithm and their risk. [sent-24, score-0.147]

18 The paper is structured as follows: next section introduces the notations and the definition of stability used throughout the paper. [sent-25, score-0.52]

19 In Section 4 we will prove that regularization networks are stable and apply the main result to obtain bounds on their generalization ability. [sent-27, score-0.621]

20 2 Notations and Definitions X and 'lJ being respectively an input and an output space, we consider a learning set S = {Zl = (Xl, yd, . [sent-29, score-0.042]

21 A learning algorithm is a function L from into 'lJx mapping a learning set S onto a function Is from X to 'lJ. [sent-35, score-0.158]

22 It is also assumed that the algorithm A is symmetric with respect to S, i. [sent-37, score-0.12]

23 for any permutation over the elements of S, Is yields the same result. [sent-39, score-0.052]

24 Furthermore, we assume that all functions are measurable and all sets are countable which does not limit the interest of the results presented here. [sent-40, score-0.096]

25 zm The empirical error of a function I measured on the training set Sis: 1 Rm(f) =- m m L c(f, Zi) i=l c: 'lJx X X x 'lJ -t 1R+ being a cost function. [sent-41, score-0.456]

26 The risk or generalization error can be written as: R(f) = Ez~D [c(f,z)] The study we describe here intends to bound the difference between empirical and generalization error for specific algorithms. [sent-42, score-0.968]

27 More precisely, our goal is to bound for any E > 0, the term (1) Usually, learning algorithms cannot output just any function in 'lJx but rather pick a function Is in a set :r S;; 'lJX representing the structure or the architecture or the model. [sent-43, score-0.205]

28 Classical VC theory deals with structural properties and aims at bounding the following quantity: PS~Dm [sup IRm(f) - R(f)1 > fE'3' EJ (2) This applies to any algorithm using :r as a hypothesis space and a bound on this quantity directly implies a similar bound on (1). [sent-44, score-0.762]

29 However, classical bounds require the VC dimension of :r to be finite and do not use information about algorithmic properties. [sent-45, score-0.555]

30 For a set :r, there exists many ways to search it which may yield different performance. [sent-46, score-0.073]

31 For instance, multilayer perceptrons can be learned by a simple backpropagation algorithm or combined with a weight decay procedure. [sent-47, score-0.17]

32 The outcome of the algorithm belongs in both cases to the same set of functions, although their performance can be different. [sent-48, score-0.2]

33 VC theory was initially motivated by empirical risk minimization (ERM) in which case the uniform bounds on the quantity (2) give tight error bounds. [sent-49, score-0.979]

34 Intuitively, the empirical risk minimization principle relies on a uniform law of large numbers. [sent-50, score-0.508]

35 Because it is not known in advance, what will be the minimum of the empirical risk, it is necessary to study the difference between empirical and generalization error for all possible functions in 9". [sent-51, score-0.663]

36 If, now, we do not consider this minimum, but instead, we focus on the outcome of a learning algorithm A, we may then know a little bit more what kind of functions will be obtained. [sent-52, score-0.288]

37 This limits the possibilities and restricts the supremum over all the functions in 9" to the possible outcomes of the algorithm. [sent-53, score-0.271]

38 An algorithm which always outputs the null function does not need to be studied by a uniform law of large numbers. [sent-54, score-0.303]

39 Let's introduce a notation for modified training sets: if S denotes the initial training set, S = {Zl, . [sent-55, score-0.158]

40 ,zm}, then Si denotes the training set after Zi has been replaced by a different training example z~, that is Si = {Zl, . [sent-61, score-0.158]

41 Now, we define a notion of stability for regression. [sent-68, score-0.339]

42 ,zm} be a training set, Si = S\Zi be the training set where instance i has been removed and A a symmetric algorithm. [sent-72, score-0.248]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bounds', 0.327), ('vc', 0.293), ('stability', 0.279), ('notations', 0.197), ('zl', 0.171), ('zi', 0.164), ('generalization', 0.163), ('cedex', 0.152), ('zm', 0.147), ('risk', 0.144), ('andre', 0.131), ('avenue', 0.131), ('bousquet', 0.131), ('empirical', 0.129), ('bound', 0.12), ('searches', 0.119), ('france', 0.11), ('refined', 0.11), ('covering', 0.11), ('hypothesis', 0.108), ('error', 0.101), ('quantity', 0.089), ('obtaining', 0.089), ('law', 0.089), ('uniform', 0.088), ('dimension', 0.088), ('si', 0.087), ('freund', 0.082), ('algorithmic', 0.082), ('outcome', 0.082), ('training', 0.079), ('algorithm', 0.074), ('search', 0.073), ('complexity', 0.073), ('regularization', 0.07), ('quantities', 0.07), ('bron', 0.066), ('dm', 0.066), ('eric', 0.066), ('erm', 0.066), ('supremum', 0.066), ('structural', 0.063), ('stable', 0.061), ('depend', 0.061), ('notion', 0.06), ('universite', 0.059), ('barnhill', 0.059), ('irm', 0.059), ('lyon', 0.059), ('possess', 0.059), ('restricts', 0.059), ('savannah', 0.059), ('searched', 0.059), ('sis', 0.059), ('wagner', 0.059), ('classical', 0.058), ('numbers', 0.058), ('minimization', 0.058), ('olivier', 0.055), ('ron', 0.055), ('possibilities', 0.055), ('novel', 0.052), ('advance', 0.052), ('aims', 0.052), ('ga', 0.052), ('null', 0.052), ('permutation', 0.052), ('explicitly', 0.051), ('backpropagation', 0.049), ('rm', 0.049), ('ej', 0.049), ('fr', 0.049), ('ez', 0.049), ('measurable', 0.049), ('sup', 0.049), ('difference', 0.047), ('functions', 0.047), ('minimum', 0.047), ('yd', 0.047), ('deals', 0.047), ('pac', 0.047), ('nice', 0.047), ('multilayer', 0.047), ('symmetric', 0.046), ('space', 0.045), ('introduces', 0.044), ('outcomes', 0.044), ('author', 0.044), ('bounding', 0.044), ('belongs', 0.044), ('technologies', 0.044), ('ym', 0.044), ('instance', 0.044), ('kind', 0.043), ('tried', 0.043), ('tight', 0.043), ('concerned', 0.043), ('pick', 0.043), ('posteriori', 0.043), ('learning', 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 21 nips-2000-Algorithmic Stability and Generalization Performance

Author: Olivier Bousquet, Andr茅 Elisseeff

Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1

2 0.18447265 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

3 0.17415407 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

Author: Shie Mannor, Ron Meir

Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1

4 0.17329074 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano

Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'

5 0.16755049 94 nips-2000-On Reversing Jensen's Inequality

Author: Tony Jebara, Alex Pentland

Abstract: Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown. 1

6 0.1457486 58 nips-2000-From Margin to Sparsity

7 0.13483235 37 nips-2000-Convergence of Large Margin Separable Linear Classification

8 0.11669762 54 nips-2000-Feature Selection for SVMs

9 0.11096144 13 nips-2000-A Tighter Bound for Graphical Models

10 0.096876532 120 nips-2000-Sparse Greedy Gaussian Process Regression

11 0.093414962 144 nips-2000-Vicinal Risk Minimization

12 0.090438306 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

13 0.083996862 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

14 0.077469796 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

15 0.072780736 44 nips-2000-Efficient Learning of Linear Perceptrons

16 0.068735704 111 nips-2000-Regularized Winnow Methods

17 0.06844344 75 nips-2000-Large Scale Bayes Point Machines

18 0.064639747 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

19 0.063812628 18 nips-2000-Active Support Vector Machine Classification

20 0.06287881 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.23), (1, 0.184), (2, -0.074), (3, -0.055), (4, 0.044), (5, -0.243), (6, 0.074), (7, -0.066), (8, -0.004), (9, -0.031), (10, -0.183), (11, 0.07), (12, 0.085), (13, 0.027), (14, 0.023), (15, 0.028), (16, 0.227), (17, 0.088), (18, -0.084), (19, 0.054), (20, 0.001), (21, -0.052), (22, 0.062), (23, -0.091), (24, 0.024), (25, 0.035), (26, 0.18), (27, -0.08), (28, -0.061), (29, 0.035), (30, 0.051), (31, -0.013), (32, 0.047), (33, -0.157), (34, 0.194), (35, 0.052), (36, -0.082), (37, 0.025), (38, 0.078), (39, -0.061), (40, 0.034), (41, -0.059), (42, 0.033), (43, -0.06), (44, -0.134), (45, 0.051), (46, -0.055), (47, 0.048), (48, -0.135), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96579403 21 nips-2000-Algorithmic Stability and Generalization Performance

Author: Olivier Bousquet, Andr茅 Elisseeff

Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1

2 0.76610184 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano

Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'

3 0.58753473 94 nips-2000-On Reversing Jensen's Inequality

Author: Tony Jebara, Alex Pentland

Abstract: Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown. 1

4 0.52532905 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

Author: Shie Mannor, Ron Meir

Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1

5 0.47927991 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

6 0.47706899 37 nips-2000-Convergence of Large Margin Separable Linear Classification

7 0.4469409 54 nips-2000-Feature Selection for SVMs

8 0.44548634 120 nips-2000-Sparse Greedy Gaussian Process Regression

9 0.41688251 58 nips-2000-From Margin to Sparsity

10 0.33273873 13 nips-2000-A Tighter Bound for Graphical Models

11 0.31961975 125 nips-2000-Stability and Noise in Biochemical Switches

12 0.3151364 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

13 0.30410537 18 nips-2000-Active Support Vector Machine Classification

14 0.30402842 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

15 0.29217756 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

16 0.28517273 144 nips-2000-Vicinal Risk Minimization

17 0.28330418 25 nips-2000-Analysis of Bit Error Probability of Direct-Sequence CDMA Multiuser Demodulators

18 0.27534911 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

19 0.27498734 44 nips-2000-Efficient Learning of Linear Perceptrons

20 0.25922766 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.062), (17, 0.144), (32, 0.012), (33, 0.078), (55, 0.027), (62, 0.031), (67, 0.077), (76, 0.052), (79, 0.019), (90, 0.063), (96, 0.31), (97, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80581176 21 nips-2000-Algorithmic Stability and Generalization Performance

Author: Olivier Bousquet, Andr茅 Elisseeff

Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1

2 0.54221892 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

3 0.52952749 74 nips-2000-Kernel Expansions with Unlabeled Examples

Author: Martin Szummer, Tommi Jaakkola

Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.

4 0.52454746 111 nips-2000-Regularized Winnow Methods

Author: Tong Zhang

Abstract: In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.

5 0.52369648 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

6 0.52345467 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

7 0.52135122 4 nips-2000-A Linear Programming Approach to Novelty Detection

8 0.51784021 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

9 0.5167039 133 nips-2000-The Kernel Gibbs Sampler

10 0.51466155 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

11 0.50907701 130 nips-2000-Text Classification using String Kernels

12 0.50755888 75 nips-2000-Large Scale Bayes Point Machines

13 0.50537086 94 nips-2000-On Reversing Jensen's Inequality

14 0.50522703 79 nips-2000-Learning Segmentation by Random Walks

15 0.50355411 52 nips-2000-Fast Training of Support Vector Classifiers

16 0.50347668 122 nips-2000-Sparse Representation for Gaussian Process Models

17 0.50077909 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

18 0.50011128 60 nips-2000-Gaussianization

19 0.49939898 36 nips-2000-Constrained Independent Component Analysis

20 0.49685374 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling