nips nips2000 nips2000-144 knowledge-graph by maker-knowledge-mining

144 nips-2000-Vicinal Risk Minimization

Source: pdf

Author: Olivier Chapelle, Jason Weston, Léon Bottou, Vladimir Vapnik

Abstract: The Vicinal Risk Minimization principle establishes a bridge between generative models and methods derived from the Structural Risk Minimization Principle such as Support Vector Machines or Statistical Regularization. We explain how VRM provides a framework which integrates a number of existing algorithms, such as Parzen windows, Support Vector Machines, Ridge Regression, Constrained Logistic Classifiers and Tangent-Prop. We then show how the approach implies new algorithms for solving problems usually associated with generative models. New algorithms are described for dealing with pattern recognition problems with very different pattern distributions and dealing with unlabeled data. Preliminary empirical results are presented.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract The Vicinal Risk Minimization principle establishes a bridge between generative models and methods derived from the Structural Risk Minimization Principle such as Support Vector Machines or Statistical Regularization. [sent-5, score-0.534]

2 We explain how VRM provides a framework which integrates a number of existing algorithms, such as Parzen windows, Support Vector Machines, Ridge Regression, Constrained Logistic Classifiers and Tangent-Prop. [sent-6, score-0.21]

3 We then show how the approach implies new algorithms for solving problems usually associated with generative models. [sent-7, score-0.394]

4 New algorithms are described for dealing with pattern recognition problems with very different pattern distributions and dealing with unlabeled data. [sent-8, score-0.878]

5 1 Introduction Structural Risk Minimisation (SRM) in a learning system can be achieved using constraints on the parameter vectors, using regularization terms in the cost function, or using Support Vector Machines (SVM). [sent-10, score-0.041]

6 All these principles have lead to well established learning algorithms. [sent-11, score-0.087]

7 It is often said, however, that some problems are best addressed by generative models. [sent-12, score-0.389]

8 We may for instance have a few labeled patterns and a large number of unlabeled patterns. [sent-14, score-0.389]

9 Intuition suggests that these unlabeled patterns carry useful information. [sent-15, score-0.428]

10 The second problem is of discriminating classes with very different pattern distributions. [sent-16, score-0.181]

11 This also occurs often in recognition systems that reject invalid patterns by defining a garbage class for grouping all ambiguous or unrecognizable cases. [sent-18, score-0.435]

12 Although there are successful non-generative approaches (Schuurmans and Southey, 2000) (Drucker, Wu and Vapnik, 1999), the generative framework is undeniably appealing. [sent-19, score-0.395]

13 Recent results (Jaakkola, Meila and Jebara, 2000) even define generative models that contain SVM as special cases. [sent-20, score-0.285]

14 This paper discusses the Vicinal Risk Minimization (VRM) principle, summarily introduced in (Vapnik, 1999). [sent-21, score-0.064]

15 This principle was independently hinted at by Tong and Koller (Tong and Koller, 2000) with a useful generative interpretation. [sent-22, score-0.434]

16 In particular, they proved that SVM are a limiting case of their Restricted Bayesian Classifiers. [sent-23, score-0.105]

17 We extend Tong's and Koller's result by showing that VRM subsumes several well known techniques such as Ridge Regression (Hoerl and Kennard, 1970), Constrained Logistic Classifier, or Tangent Prop (Simard et aI. [sent-24, score-0.064]

18 We then go on to show how VRM naturally leads to simple algo- rithms that can deal with problems for which one would have formally considered purely generative models. [sent-26, score-0.602]

19 We provide algorithms and preliminary empirical results for dealing with unlabeled data or recognizing classes with very different pattern distributions. [sent-27, score-0.78]

20 2 Vicinal Risk Minimization The learning problem can be formulated as the search of the function f E F that minimizes the expectation of a given loss £(f(x), y) . [sent-28, score-0.102]

21 R(f) = f £(f(x), y) dP(x, y) (1) In the classification framework, y takes values ±1 and £(f(x) , y) is a step function such as 1 - Sign(yf(x)), whereas in the regression framework, y is a real number and commonly £(f(x), y) is the mean squared error (f(x) _ y)2. [sent-29, score-0.144]

22 The expectation (1) cannot be computed since the distribution P(x, y) is unknown. [sent-30, score-0.057]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vicinal', 0.355), ('vrm', 0.355), ('generative', 0.285), ('unlabeled', 0.277), ('risk', 0.251), ('koller', 0.171), ('tong', 0.171), ('minimization', 0.169), ('dealing', 0.163), ('ridge', 0.153), ('vapnik', 0.144), ('chapelle', 0.139), ('principle', 0.113), ('regression', 0.104), ('logistic', 0.104), ('svm', 0.102), ('preliminary', 0.086), ('drucker', 0.076), ('simard', 0.076), ('bridge', 0.076), ('srm', 0.076), ('weston', 0.076), ('reject', 0.076), ('rithms', 0.076), ('vlad', 0.076), ('structural', 0.073), ('naturally', 0.073), ('patterns', 0.072), ('framework', 0.072), ('leon', 0.069), ('labs', 0.069), ('parzen', 0.069), ('tangent', 0.069), ('barnhill', 0.069), ('savannah', 0.069), ('discriminating', 0.069), ('machines', 0.067), ('bottou', 0.064), ('olivier', 0.064), ('schultz', 0.064), ('invalid', 0.064), ('said', 0.064), ('jebara', 0.064), ('discusses', 0.064), ('meila', 0.064), ('subsumes', 0.064), ('yf', 0.064), ('integrates', 0.064), ('pattern', 0.062), ('wu', 0.06), ('establishes', 0.06), ('red', 0.06), ('vladimir', 0.06), ('ga', 0.06), ('minimisation', 0.057), ('grouping', 0.057), ('limiting', 0.057), ('expectation', 0.057), ('problems', 0.056), ('constrained', 0.056), ('bank', 0.054), ('schuurmans', 0.054), ('support', 0.054), ('algorithms', 0.053), ('established', 0.052), ('classes', 0.05), ('dp', 0.05), ('ambiguous', 0.05), ('drive', 0.05), ('proved', 0.048), ('addressed', 0.048), ('recognizing', 0.046), ('missing', 0.046), ('intuition', 0.046), ('formulated', 0.045), ('windows', 0.043), ('carry', 0.043), ('empirical', 0.043), ('purely', 0.042), ('recognition', 0.042), ('arises', 0.041), ('regularization', 0.041), ('defining', 0.041), ('labeled', 0.04), ('commonly', 0.04), ('jaakkola', 0.04), ('successful', 0.038), ('classifiers', 0.038), ('nj', 0.038), ('explain', 0.038), ('formally', 0.037), ('existing', 0.036), ('detection', 0.036), ('useful', 0.036), ('sign', 0.035), ('principles', 0.035), ('situation', 0.034), ('usa', 0.034), ('go', 0.033), ('occurs', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 144 nips-2000-Vicinal Risk Minimization

Author: Olivier Chapelle, Jason Weston, Léon Bottou, Vladimir Vapnik

2 0.17934217 74 nips-2000-Kernel Expansions with Unlabeled Examples

Author: Martin Szummer, Tommi Jaakkola

Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.

3 0.11546011 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

Author: Simon Tong, Daphne Koller

Abstract: Bayesian networks are graphical representations of probability distributions. In virtually all of the work on learning these networks, the assumption is that we are presented with a data set consisting of randomly generated instances from the underlying distribution. In many situations, however, we also have the option of active learning, where we have the possibility of guiding the sampling process by querying for certain types of samples. This paper addresses the problem of estimating the parameters of Bayesian networks in an active learning setting. We provide a theoretical framework for this problem, and an algorithm that chooses which active learning queries to generate based on the model learned so far. We present experimental results showing that our active learning algorithm can significantly reduce the need for training data in many situations.

4 0.10351574 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

Author: Sepp Hochreiter, Michael Mozer

Abstract: The goal of many unsupervised learning procedures is to bring two probability distributions into alignment. Generative models such as Gaussian mixtures and Boltzmann machines can be cast in this light, as can recoding models such as ICA and projection pursuit. We propose a novel sample-based error measure for these classes of models, which applies even in situations where maximum likelihood (ML) and probability density estimation-based formulations cannot be applied, e.g., models that are nonlinear or have intractable posteriors. Furthermore, our sample-based error measure avoids the difficulties of approximating a density function. We prove that with an unconstrained model, (1) our approach converges on the correct solution as the number of samples goes to infinity, and (2) the expected solution of our approach in the generative framework is the ML solution. Finally, we evaluate our approach via simulations of linear and nonlinear models on mixture of Gaussians and ICA problems. The experiments show the broad applicability and generality of our approach. 1

5 0.10263168 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

6 0.09790545 54 nips-2000-Feature Selection for SVMs

7 0.093414962 21 nips-2000-Algorithmic Stability and Generalization Performance

8 0.069503225 75 nips-2000-Large Scale Bayes Point Machines

9 0.067931429 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

10 0.061941452 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

11 0.057404224 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

12 0.05352848 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

13 0.052966364 12 nips-2000-A Support Vector Method for Clustering

14 0.050897285 58 nips-2000-From Margin to Sparsity

15 0.044326171 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

16 0.043745838 133 nips-2000-The Kernel Gibbs Sampler

17 0.042888336 37 nips-2000-Convergence of Large Margin Separable Linear Classification

18 0.042831607 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

19 0.042803146 110 nips-2000-Regularization with Dot-Product Kernels

20 0.042650029 111 nips-2000-Regularized Winnow Methods

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.152), (1, 0.115), (2, -0.013), (3, 0.01), (4, 0.009), (5, -0.053), (6, 0.026), (7, -0.018), (8, 0.022), (9, 0.059), (10, -0.053), (11, -0.071), (12, 0.195), (13, -0.053), (14, 0.092), (15, -0.038), (16, 0.068), (17, -0.097), (18, -0.02), (19, 0.054), (20, -0.313), (21, 0.042), (22, 0.27), (23, 0.025), (24, -0.021), (25, -0.133), (26, 0.143), (27, 0.007), (28, 0.09), (29, -0.144), (30, 0.187), (31, 0.029), (32, -0.027), (33, -0.178), (34, -0.185), (35, 0.108), (36, 0.034), (37, 0.01), (38, -0.024), (39, -0.035), (40, -0.057), (41, 0.088), (42, -0.106), (43, -0.063), (44, -0.075), (45, 0.022), (46, -0.21), (47, -0.038), (48, 0.183), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97673142 144 nips-2000-Vicinal Risk Minimization

Author: Olivier Chapelle, Jason Weston, Léon Bottou, Vladimir Vapnik

2 0.59696418 74 nips-2000-Kernel Expansions with Unlabeled Examples

Author: Martin Szummer, Tommi Jaakkola

3 0.42289096 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

Author: Sepp Hochreiter, Michael Mozer

4 0.38471514 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

Author: Simon Tong, Daphne Koller

5 0.32609233 44 nips-2000-Efficient Learning of Linear Perceptrons

Author: Shai Ben-David, Hans-Ulrich Simon

Abstract: We consider the existence of efficient algorithms for learning the class of half-spaces in ~n in the agnostic learning model (Le., making no prior assumptions on the example-generating distribution). The resulting combinatorial problem - finding the best agreement half-space over an input sample - is NP hard to approximate to within some constant factor. We suggest a way to circumvent this theoretical bound by introducing a new measure of success for such algorithms. An algorithm is IL-margin successful if the agreement ratio of the half-space it outputs is as good as that of any half-space once training points that are inside the IL-margins of its separating hyper-plane are disregarded. We prove crisp computational complexity results with respect to this success measure: On one hand, for every positive IL, there exist efficient (poly-time) IL-margin successful learning algorithms. On the other hand, we prove that unless P=NP, there is no algorithm that runs in time polynomial in the sample size and in 1/ IL that is IL-margin successful for all IL> O. 1

6 0.30730435 54 nips-2000-Feature Selection for SVMs

7 0.30366927 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

8 0.26165414 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

9 0.25108975 70 nips-2000-Incremental and Decremental Support Vector Machine Learning

10 0.23565601 21 nips-2000-Algorithmic Stability and Generalization Performance

11 0.22099163 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

12 0.20694929 75 nips-2000-Large Scale Bayes Point Machines

13 0.20341954 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

14 0.19875431 18 nips-2000-Active Support Vector Machine Classification

15 0.19664444 133 nips-2000-The Kernel Gibbs Sampler

16 0.19000667 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

17 0.18431041 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

18 0.18321584 138 nips-2000-The Use of Classifiers in Sequential Inference

19 0.16619101 12 nips-2000-A Support Vector Method for Clustering

20 0.16395909 51 nips-2000-Factored Semi-Tied Covariance Matrices

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.703), (17, 0.078), (33, 0.027), (62, 0.017), (76, 0.04), (90, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96271902 144 nips-2000-Vicinal Risk Minimization

Author: Olivier Chapelle, Jason Weston, Léon Bottou, Vladimir Vapnik

2 0.86006963 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

Author: Dirk Ormoneit, Peter W. Glynn

Abstract: Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporal-difference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algorithms are frequently unstable. In this work, we present a new, kernel-based approach to reinforcement learning which overcomes this difficulty and provably converges to a unique solution. By contrast to existing algorithms, our method can also be shown to be consistent in the sense that its costs converge to the optimal costs asymptotically. Our focus is on learning in an average-cost framework and on a practical application to the optimal portfolio choice problem. 1

3 0.65943438 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

Author: Zhaoping Li, Peter Dayan

Abstract: Stimulus arrays are inevitably presented at different positions on the retina in visual tasks, even those that nominally require fixation. In particular, this applies to many perceptual learning tasks. We show that perceptual inference or discrimination in the face of positional variance has a structurally different quality from inference about fixed position stimuli, involving a particular, quadratic, non-linearity rather than a purely linear discrimination. We show the advantage taking this non-linearity into account has for discrimination, and suggest it as a role for recurrent connections in area VI, by demonstrating the superior discrimination performance of a recurrent network. We propose that learning the feedforward and recurrent neural connections for these tasks corresponds to the fast and slow components of learning observed in perceptual learning tasks.

4 0.59144282 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

5 0.33879378 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano

Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'

6 0.33862752 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

7 0.32177886 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

8 0.31845531 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding

9 0.30585265 49 nips-2000-Explaining Away in Weight Space

10 0.30580196 75 nips-2000-Large Scale Bayes Point Machines

11 0.30478176 34 nips-2000-Competition and Arbors in Ocular Dominance

12 0.29790699 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

13 0.29525802 147 nips-2000-Who Does What? A Novel Algorithm to Determine Function Localization

14 0.29497585 133 nips-2000-The Kernel Gibbs Sampler

15 0.2862584 116 nips-2000-Sex with Support Vector Machines

16 0.28230861 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

17 0.27989617 130 nips-2000-Text Classification using String Kernels

18 0.27766731 4 nips-2000-A Linear Programming Approach to Novelty Detection

19 0.27762103 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates

20 0.27744329 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes