nips nips2000 nips2000-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Olivier Bousquet, Andr茅 Elisseeff
Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Algorithmic Stability and Generalization Performance Olivier Bousquet CMAP Ecole Polytechnique F-91128 Palaiseau cedex FRANCE bousquet@cmapx. [sent-1, score-0.152]
2 com Abstract We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. [sent-3, score-1.104]
3 A stable learner is one for which the learned solution does not change much with small changes in the training set. [sent-4, score-0.14]
4 The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e. [sent-5, score-0.614]
5 VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. [sent-7, score-0.384]
6 We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. [sent-8, score-0.898]
7 1 Introduction A key issue in computational learning theory is to bound the generalization error of learning algorithms. [sent-9, score-0.468]
8 Until recently, most of the research in that area has focused on uniform a-priori bounds giving a guarantee that the difference between the training error and the test error is uniformly small for any hypothesis in a given class. [sent-10, score-0.851]
9 These bounds are usually expressed in terms of combinatorial quantities such as VCdimension. [sent-11, score-0.397]
10 In the last few years, researchers have tried to use more refined quantities to either estimate the complexity of the search space (e. [sent-12, score-0.414]
11 covering numbers [1]) or to use a posteriori information about the solution found by the algorithm (e. [sent-14, score-0.285]
12 There exist other approaches such as observed VC dimension [12], but all are concerned with structural properties of the learning systems. [sent-17, score-0.236]
13 In this paper we present a novel way of obtaining PAC bounds for specific algorithms explicitly using their stability properties. [sent-18, score-0.798]
14 This method has the nice advantage of providing bounds that do -This work was done while the author was at Laboratoire ERIC, Universite Lumiere Lyon 2,5 avenue Pierre Mendes-France, F-69676 Bron cedex, FRANCE not depend on any complexity measure of the search space (e. [sent-20, score-0.801]
15 VC-dimension or covering numbers) but rather on the way the algorithm searches this space. [sent-22, score-0.303]
16 In that respect, our approach can be related to Freund's [7] where the estimated size of the subset of the hypothesis space actually searched by the algorithm is used to bound its generalization error. [sent-23, score-0.569]
17 However Freund's result depends on a complexity term which we do not have here since we are not looking separately at the hypotheses considered by the algorithm and their risk. [sent-24, score-0.147]
18 The paper is structured as follows: next section introduces the notations and the definition of stability used throughout the paper. [sent-25, score-0.52]
19 In Section 4 we will prove that regularization networks are stable and apply the main result to obtain bounds on their generalization ability. [sent-27, score-0.621]
20 2 Notations and Definitions X and 'lJ being respectively an input and an output space, we consider a learning set S = {Zl = (Xl, yd, . [sent-29, score-0.042]
21 A learning algorithm is a function L from into 'lJx mapping a learning set S onto a function Is from X to 'lJ. [sent-35, score-0.158]
22 It is also assumed that the algorithm A is symmetric with respect to S, i. [sent-37, score-0.12]
23 for any permutation over the elements of S, Is yields the same result. [sent-39, score-0.052]
24 Furthermore, we assume that all functions are measurable and all sets are countable which does not limit the interest of the results presented here. [sent-40, score-0.096]
25 zm The empirical error of a function I measured on the training set Sis: 1 Rm(f) =- m m L c(f, Zi) i=l c: 'lJx X X x 'lJ -t 1R+ being a cost function. [sent-41, score-0.456]
26 The risk or generalization error can be written as: R(f) = Ez~D [c(f,z)] The study we describe here intends to bound the difference between empirical and generalization error for specific algorithms. [sent-42, score-0.968]
27 More precisely, our goal is to bound for any E > 0, the term (1) Usually, learning algorithms cannot output just any function in 'lJx but rather pick a function Is in a set :r S;; 'lJX representing the structure or the architecture or the model. [sent-43, score-0.205]
28 Classical VC theory deals with structural properties and aims at bounding the following quantity: PS~Dm [sup IRm(f) - R(f)1 > fE'3' EJ (2) This applies to any algorithm using :r as a hypothesis space and a bound on this quantity directly implies a similar bound on (1). [sent-44, score-0.762]
29 However, classical bounds require the VC dimension of :r to be finite and do not use information about algorithmic properties. [sent-45, score-0.555]
30 For a set :r, there exists many ways to search it which may yield different performance. [sent-46, score-0.073]
31 For instance, multilayer perceptrons can be learned by a simple backpropagation algorithm or combined with a weight decay procedure. [sent-47, score-0.17]
32 The outcome of the algorithm belongs in both cases to the same set of functions, although their performance can be different. [sent-48, score-0.2]
33 VC theory was initially motivated by empirical risk minimization (ERM) in which case the uniform bounds on the quantity (2) give tight error bounds. [sent-49, score-0.979]
34 Intuitively, the empirical risk minimization principle relies on a uniform law of large numbers. [sent-50, score-0.508]
35 Because it is not known in advance, what will be the minimum of the empirical risk, it is necessary to study the difference between empirical and generalization error for all possible functions in 9". [sent-51, score-0.663]
36 If, now, we do not consider this minimum, but instead, we focus on the outcome of a learning algorithm A, we may then know a little bit more what kind of functions will be obtained. [sent-52, score-0.288]
37 This limits the possibilities and restricts the supremum over all the functions in 9" to the possible outcomes of the algorithm. [sent-53, score-0.271]
38 An algorithm which always outputs the null function does not need to be studied by a uniform law of large numbers. [sent-54, score-0.303]
39 Let's introduce a notation for modified training sets: if S denotes the initial training set, S = {Zl, . [sent-55, score-0.158]
40 ,zm}, then Si denotes the training set after Zi has been replaced by a different training example z~, that is Si = {Zl, . [sent-61, score-0.158]
41 Now, we define a notion of stability for regression. [sent-68, score-0.339]
42 ,zm} be a training set, Si = S\Zi be the training set where instance i has been removed and A a symmetric algorithm. [sent-72, score-0.248]
wordName wordTfidf (topN-words)
[('bounds', 0.327), ('vc', 0.293), ('stability', 0.279), ('notations', 0.197), ('zl', 0.171), ('zi', 0.164), ('generalization', 0.163), ('cedex', 0.152), ('zm', 0.147), ('risk', 0.144), ('andre', 0.131), ('avenue', 0.131), ('bousquet', 0.131), ('empirical', 0.129), ('bound', 0.12), ('searches', 0.119), ('france', 0.11), ('refined', 0.11), ('covering', 0.11), ('hypothesis', 0.108), ('error', 0.101), ('quantity', 0.089), ('obtaining', 0.089), ('law', 0.089), ('uniform', 0.088), ('dimension', 0.088), ('si', 0.087), ('freund', 0.082), ('algorithmic', 0.082), ('outcome', 0.082), ('training', 0.079), ('algorithm', 0.074), ('search', 0.073), ('complexity', 0.073), ('regularization', 0.07), ('quantities', 0.07), ('bron', 0.066), ('dm', 0.066), ('eric', 0.066), ('erm', 0.066), ('supremum', 0.066), ('structural', 0.063), ('stable', 0.061), ('depend', 0.061), ('notion', 0.06), ('universite', 0.059), ('barnhill', 0.059), ('irm', 0.059), ('lyon', 0.059), ('possess', 0.059), ('restricts', 0.059), ('savannah', 0.059), ('searched', 0.059), ('sis', 0.059), ('wagner', 0.059), ('classical', 0.058), ('numbers', 0.058), ('minimization', 0.058), ('olivier', 0.055), ('ron', 0.055), ('possibilities', 0.055), ('novel', 0.052), ('advance', 0.052), ('aims', 0.052), ('ga', 0.052), ('null', 0.052), ('permutation', 0.052), ('explicitly', 0.051), ('backpropagation', 0.049), ('rm', 0.049), ('ej', 0.049), ('fr', 0.049), ('ez', 0.049), ('measurable', 0.049), ('sup', 0.049), ('difference', 0.047), ('functions', 0.047), ('minimum', 0.047), ('yd', 0.047), ('deals', 0.047), ('pac', 0.047), ('nice', 0.047), ('multilayer', 0.047), ('symmetric', 0.046), ('space', 0.045), ('introduces', 0.044), ('outcomes', 0.044), ('author', 0.044), ('bounding', 0.044), ('belongs', 0.044), ('technologies', 0.044), ('ym', 0.044), ('instance', 0.044), ('kind', 0.043), ('tried', 0.043), ('tight', 0.043), ('concerned', 0.043), ('pick', 0.043), ('posteriori', 0.043), ('learning', 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 21 nips-2000-Algorithmic Stability and Generalization Performance
Author: Olivier Bousquet, Andr茅 Elisseeff
Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1
2 0.18447265 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
Author: Ralf Herbrich, Thore Graepel
Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1
3 0.17415407 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
Author: Shie Mannor, Ron Meir
Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1
4 0.17329074 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano
Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'
5 0.16755049 94 nips-2000-On Reversing Jensen's Inequality
Author: Tony Jebara, Alex Pentland
Abstract: Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown. 1
6 0.1457486 58 nips-2000-From Margin to Sparsity
7 0.13483235 37 nips-2000-Convergence of Large Margin Separable Linear Classification
8 0.11669762 54 nips-2000-Feature Selection for SVMs
9 0.11096144 13 nips-2000-A Tighter Bound for Graphical Models
10 0.096876532 120 nips-2000-Sparse Greedy Gaussian Process Regression
11 0.093414962 144 nips-2000-Vicinal Risk Minimization
12 0.090438306 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
13 0.083996862 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
14 0.077469796 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems
15 0.072780736 44 nips-2000-Efficient Learning of Linear Perceptrons
16 0.068735704 111 nips-2000-Regularized Winnow Methods
17 0.06844344 75 nips-2000-Large Scale Bayes Point Machines
18 0.064639747 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
19 0.063812628 18 nips-2000-Active Support Vector Machine Classification
20 0.06287881 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
topicId topicWeight
[(0, 0.23), (1, 0.184), (2, -0.074), (3, -0.055), (4, 0.044), (5, -0.243), (6, 0.074), (7, -0.066), (8, -0.004), (9, -0.031), (10, -0.183), (11, 0.07), (12, 0.085), (13, 0.027), (14, 0.023), (15, 0.028), (16, 0.227), (17, 0.088), (18, -0.084), (19, 0.054), (20, 0.001), (21, -0.052), (22, 0.062), (23, -0.091), (24, 0.024), (25, 0.035), (26, 0.18), (27, -0.08), (28, -0.061), (29, 0.035), (30, 0.051), (31, -0.013), (32, 0.047), (33, -0.157), (34, 0.194), (35, 0.052), (36, -0.082), (37, 0.025), (38, 0.078), (39, -0.061), (40, 0.034), (41, -0.059), (42, 0.033), (43, -0.06), (44, -0.134), (45, 0.051), (46, -0.055), (47, 0.048), (48, -0.135), (49, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.96579403 21 nips-2000-Algorithmic Stability and Generalization Performance
Author: Olivier Bousquet, Andr茅 Elisseeff
Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1
2 0.76610184 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano
Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'
3 0.58753473 94 nips-2000-On Reversing Jensen's Inequality
Author: Tony Jebara, Alex Pentland
Abstract: Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown. 1
4 0.52532905 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
Author: Shie Mannor, Ron Meir
Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1
5 0.47927991 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
Author: Ralf Herbrich, Thore Graepel
Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1
6 0.47706899 37 nips-2000-Convergence of Large Margin Separable Linear Classification
7 0.4469409 54 nips-2000-Feature Selection for SVMs
8 0.44548634 120 nips-2000-Sparse Greedy Gaussian Process Regression
9 0.41688251 58 nips-2000-From Margin to Sparsity
10 0.33273873 13 nips-2000-A Tighter Bound for Graphical Models
11 0.31961975 125 nips-2000-Stability and Noise in Biochemical Switches
12 0.3151364 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
13 0.30410537 18 nips-2000-Active Support Vector Machine Classification
14 0.30402842 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
15 0.29217756 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
16 0.28517273 144 nips-2000-Vicinal Risk Minimization
17 0.28330418 25 nips-2000-Analysis of Bit Error Probability of Direct-Sequence CDMA Multiuser Demodulators
18 0.27534911 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
19 0.27498734 44 nips-2000-Efficient Learning of Linear Perceptrons
20 0.25922766 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
topicId topicWeight
[(10, 0.062), (17, 0.144), (32, 0.012), (33, 0.078), (55, 0.027), (62, 0.031), (67, 0.077), (76, 0.052), (79, 0.019), (90, 0.063), (96, 0.31), (97, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.80581176 21 nips-2000-Algorithmic Stability and Generalization Performance
Author: Olivier Bousquet, Andr茅 Elisseeff
Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1
2 0.54221892 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
Author: Ralf Herbrich, Thore Graepel
Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1
3 0.52952749 74 nips-2000-Kernel Expansions with Unlabeled Examples
Author: Martin Szummer, Tommi Jaakkola
Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.
4 0.52454746 111 nips-2000-Regularized Winnow Methods
Author: Tong Zhang
Abstract: In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.
5 0.52369648 37 nips-2000-Convergence of Large Margin Separable Linear Classification
Author: Tong Zhang
Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed
6 0.52345467 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
7 0.52135122 4 nips-2000-A Linear Programming Approach to Novelty Detection
8 0.51784021 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
9 0.5167039 133 nips-2000-The Kernel Gibbs Sampler
10 0.51466155 102 nips-2000-Position Variance, Recurrence and Perceptual Learning
11 0.50907701 130 nips-2000-Text Classification using String Kernels
12 0.50755888 75 nips-2000-Large Scale Bayes Point Machines
13 0.50537086 94 nips-2000-On Reversing Jensen's Inequality
14 0.50522703 79 nips-2000-Learning Segmentation by Random Walks
15 0.50355411 52 nips-2000-Fast Training of Support Vector Classifiers
16 0.50347668 122 nips-2000-Sparse Representation for Gaussian Process Models
17 0.50077909 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
18 0.50011128 60 nips-2000-Gaussianization
19 0.49939898 36 nips-2000-Constrained Independent Component Analysis
20 0.49685374 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling