nips nips2000 nips2000-119 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano
Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'
Reference: text
sentIndex sentText sentNum sentScore
1 Some new bounds on the generalization error of combined classifiers Vladimir Koltchinskii Department of Mathematics and Statistics University of New Mexico Albuquerque, NM 87131-1141 vlad@math. [sent-1, score-0.505]
2 edu Abstract In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. [sent-7, score-0.793]
3 The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. [sent-8, score-0.918]
4 As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. [sent-9, score-0.426]
5 We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. [sent-10, score-0.552]
6 Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. [sent-11, score-0.532]
7 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. [sent-12, score-0.606]
8 For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. [sent-14, score-0.333]
9 If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), . [sent-15, score-0.243]
10 The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. [sent-22, score-0.462]
11 We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. [sent-23, score-0.975]
12 We omit all the proofs and refer an interested reader to [5]. [sent-24, score-0.264]
13 Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. [sent-25, score-0.342]
14 random variables taking values in (8, A) with common distribution P. [sent-29, score-0.14]
15 Let Pn be the empirical measure based on the sample (Xl,'" ,Xn), Pn := n- l E~=l c5 x " where c5 x denotes the probability distribution con- x. [sent-30, score-0.124]
16 In what follows, £OO(F) denotes the Banach space of uniformly bounded real valued functions on F with the norm IIYII. [sent-36, score-0.289]
17 standard normal random variables, independent of {Xi}' We will call n t-+ Gn(F) the Gaussian complexity function of the class F. [sent-42, score-0.311]
18 [11]) various upper bounds on such quantities as Gn (F) in terms of entropies, VC-dimensions, etc. [sent-45, score-0.323]
19 We give below a bound in terms of margin cost functions (compare to [6, 7]) and Gaussian complexities. [sent-46, score-0.377]
20 :-::; O} > + Let us consider a special family of cost functions. [sent-48, score-0.123]
21 Assume that cP is a fixed non increasing Lipschitz function from IR into IR such that cp(x) 2: (1 + sgn( -x)) /2 for all x E lR. [sent-49, score-0.089]
22 One can easily observe that L( cpU 15)) :-::; L( cP )15- 1 . [sent-50, score-0.038]
wordName wordTfidf (topN-words)
[('albuquerque', 0.311), ('mexico', 0.311), ('bartlett', 0.245), ('schapire', 0.211), ('predictor', 0.211), ('bounds', 0.203), ('nm', 0.199), ('bounding', 0.182), ('cp', 0.182), ('gn', 0.179), ('freund', 0.168), ('pn', 0.162), ('lipschitz', 0.162), ('generalization', 0.148), ('margin', 0.144), ('inequalities', 0.141), ('boosting', 0.133), ('mathematics', 0.121), ('let', 0.113), ('ir', 0.112), ('oo', 0.108), ('functional', 0.101), ('class', 0.099), ('complexity', 0.099), ('fernando', 0.089), ('vlad', 0.089), ('yg', 0.089), ('classes', 0.088), ('copies', 0.081), ('omit', 0.081), ('entropies', 0.081), ('error', 0.075), ('jp', 0.075), ('classifier', 0.074), ('vladimir', 0.07), ('cpu', 0.07), ('unknown', 0.068), ('tighter', 0.066), ('measurable', 0.066), ('functions', 0.064), ('gi', 0.063), ('sand', 0.063), ('sgn', 0.061), ('couple', 0.061), ('xd', 0.061), ('dp', 0.058), ('papers', 0.058), ('non', 0.058), ('proofs', 0.058), ('type', 0.058), ('valued', 0.056), ('reader', 0.056), ('gaussian', 0.055), ('label', 0.054), ('improve', 0.053), ('cost', 0.052), ('department', 0.051), ('norm', 0.05), ('empirical', 0.05), ('substantially', 0.049), ('electrical', 0.049), ('hypotheses', 0.047), ('quantities', 0.047), ('statistics', 0.047), ('sequence', 0.046), ('terms', 0.045), ('give', 0.045), ('classifiers', 0.045), ('lee', 0.045), ('random', 0.044), ('uniformly', 0.043), ('denotes', 0.042), ('sign', 0.041), ('family', 0.041), ('interested', 0.04), ('normal', 0.038), ('observe', 0.038), ('literature', 0.037), ('applying', 0.036), ('prove', 0.036), ('assumptions', 0.035), ('develop', 0.035), ('combined', 0.034), ('processes', 0.034), ('bounded', 0.034), ('taking', 0.032), ('variables', 0.032), ('xn', 0.032), ('distribution', 0.032), ('call', 0.031), ('increasing', 0.031), ('us', 0.03), ('engineering', 0.03), ('instance', 0.03), ('refer', 0.029), ('various', 0.028), ('theorem', 0.028), ('find', 0.028), ('allow', 0.027), ('bound', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano
Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'
2 0.23673254 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
Author: Shie Mannor, Ron Meir
Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1
3 0.17329074 21 nips-2000-Algorithmic Stability and Generalization Performance
Author: Olivier Bousquet, Andr茅 Elisseeff
Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1
4 0.15395647 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
Author: Ralf Herbrich, Thore Graepel
Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1
5 0.14694262 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems
Author: Richard S. Zemel, Toniann Pitassi
Abstract: In adaptive boosting, several weak learners trained sequentially are combined to boost the overall algorithm performance. Recently adaptive boosting methods for classification problems have been derived as gradient descent algorithms. This formulation justifies key elements and parameters in the methods, all chosen to optimize a single common objective function. We propose an analogous formulation for adaptive boosting of regression problems, utilizing a novel objective function that leads to a simple boosting algorithm. We prove that this method reduces training error, and compare its performance to other regression methods. The aim of boosting algorithms is to
6 0.1339345 58 nips-2000-From Margin to Sparsity
7 0.12096026 94 nips-2000-On Reversing Jensen's Inequality
8 0.10792703 37 nips-2000-Convergence of Large Margin Separable Linear Classification
9 0.098462045 120 nips-2000-Sparse Greedy Gaussian Process Regression
10 0.089792341 54 nips-2000-Feature Selection for SVMs
11 0.088845581 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
12 0.076851845 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
13 0.075520441 75 nips-2000-Large Scale Bayes Point Machines
14 0.067844376 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
15 0.065609761 18 nips-2000-Active Support Vector Machine Classification
16 0.064010069 133 nips-2000-The Kernel Gibbs Sampler
17 0.061901607 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
18 0.054242536 13 nips-2000-A Tighter Bound for Graphical Models
19 0.05234177 4 nips-2000-A Linear Programming Approach to Novelty Detection
20 0.050252788 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
topicId topicWeight
[(0, 0.196), (1, 0.212), (2, -0.069), (3, -0.032), (4, 0.011), (5, -0.235), (6, 0.055), (7, -0.014), (8, 0.017), (9, -0.067), (10, -0.223), (11, 0.092), (12, 0.094), (13, 0.026), (14, 0.015), (15, 0.068), (16, 0.228), (17, 0.199), (18, 0.02), (19, -0.093), (20, 0.003), (21, -0.044), (22, 0.049), (23, -0.135), (24, 0.039), (25, 0.095), (26, -0.024), (27, -0.02), (28, -0.0), (29, 0.145), (30, -0.086), (31, 0.025), (32, 0.042), (33, -0.025), (34, 0.094), (35, -0.011), (36, 0.03), (37, -0.038), (38, 0.016), (39, -0.012), (40, -0.03), (41, -0.035), (42, 0.06), (43, -0.03), (44, -0.016), (45, -0.013), (46, 0.023), (47, 0.042), (48, -0.045), (49, -0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.97073835 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano
Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'
2 0.78281736 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
Author: Shie Mannor, Ron Meir
Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1
3 0.75951672 21 nips-2000-Algorithmic Stability and Generalization Performance
Author: Olivier Bousquet, Andr茅 Elisseeff
Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1
4 0.63885552 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems
Author: Richard S. Zemel, Toniann Pitassi
Abstract: In adaptive boosting, several weak learners trained sequentially are combined to boost the overall algorithm performance. Recently adaptive boosting methods for classification problems have been derived as gradient descent algorithms. This formulation justifies key elements and parameters in the methods, all chosen to optimize a single common objective function. We propose an analogous formulation for adaptive boosting of regression problems, utilizing a novel objective function that leads to a simple boosting algorithm. We prove that this method reduces training error, and compare its performance to other regression methods. The aim of boosting algorithms is to
5 0.48096746 94 nips-2000-On Reversing Jensen's Inequality
Author: Tony Jebara, Alex Pentland
Abstract: Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown. 1
6 0.46943983 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
7 0.42128786 37 nips-2000-Convergence of Large Margin Separable Linear Classification
8 0.41596043 120 nips-2000-Sparse Greedy Gaussian Process Regression
9 0.39327249 58 nips-2000-From Margin to Sparsity
10 0.35149503 54 nips-2000-Feature Selection for SVMs
11 0.33698466 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
12 0.31100291 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
13 0.30352148 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
14 0.26502147 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
15 0.26309338 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
16 0.25196061 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
17 0.24218139 13 nips-2000-A Tighter Bound for Graphical Models
18 0.21979818 75 nips-2000-Large Scale Bayes Point Machines
19 0.2191021 138 nips-2000-The Use of Classifiers in Sequential Inference
20 0.20591405 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles
topicId topicWeight
[(10, 0.067), (13, 0.303), (17, 0.136), (32, 0.023), (33, 0.077), (55, 0.015), (62, 0.03), (65, 0.027), (67, 0.046), (76, 0.074), (90, 0.102), (97, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.81623125 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano
Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'
2 0.55284619 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
Author: Claudio Gentile
Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.
3 0.55236226 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
Author: Ralf Herbrich, Thore Graepel
Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1
4 0.53864557 74 nips-2000-Kernel Expansions with Unlabeled Examples
Author: Martin Szummer, Tommi Jaakkola
Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.
5 0.53349555 4 nips-2000-A Linear Programming Approach to Novelty Detection
Author: Colin Campbell, Kristin P. Bennett
Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J
6 0.53313404 111 nips-2000-Regularized Winnow Methods
7 0.52624142 21 nips-2000-Algorithmic Stability and Generalization Performance
8 0.52530295 52 nips-2000-Fast Training of Support Vector Classifiers
9 0.5219022 75 nips-2000-Large Scale Bayes Point Machines
10 0.5202871 37 nips-2000-Convergence of Large Margin Separable Linear Classification
11 0.51886237 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra
12 0.51720339 133 nips-2000-The Kernel Gibbs Sampler
13 0.51054823 36 nips-2000-Constrained Independent Component Analysis
14 0.51034701 130 nips-2000-Text Classification using String Kernels
15 0.50759304 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
16 0.50490004 122 nips-2000-Sparse Representation for Gaussian Process Models
17 0.50331736 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
18 0.50175625 102 nips-2000-Position Variance, Recurrence and Perceptual Learning
19 0.50163609 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
20 0.49979499 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets