nips nips2000 nips2000-74 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martin Szummer, Tommi Jaakkola
Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.
Reference: text
sentIndex sentText sentNum sentScore
1 Kernel expansions with unlabeled examples Martin Szummer MIT AI Lab & CBCL Cambridge, MA szummer@ai. [sent-1, score-0.657]
2 edu Abstract Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. [sent-5, score-1.32]
3 We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. [sent-6, score-0.886]
4 This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. [sent-7, score-0.787]
5 The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. [sent-8, score-0.924]
6 We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. [sent-9, score-0.579]
7 We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy. [sent-10, score-0.555]
8 1 Introduction In many modern classification problems such as text categorization, very few labeled examples are available but a large number of unlabeled examples can be readily acquired. [sent-11, score-1.348]
9 Various methods have recently been proposed to take advantage of unlabeled examples to improve classification performance. [sent-12, score-0.735]
10 Such methods include the EM algorithm with naive Bayes models for text classification [1], the co-training framework [2], transduction [3, 4], and maximum entropy discrimination [5] . [sent-13, score-0.435]
11 Unfortunately, the computational effort scales exponentially with the number of unlabeled examples for exact solutions in discriminative approaches such as transduction [3, 5]. [sent-15, score-0.908]
12 In this paper, we formulate a complementary discriminative approach to exploiting unlabeled examples, effectively by using them to expand the representation of examples. [sent-17, score-0.797]
13 This approach has several advantages including the ability to represent the true Bayes optimal decision boundary and making explicit use of the density over the examples. [sent-18, score-0.185]
14 We start by discussing the kernel density estimate and providing a smoothness condition, assuming labeled data only. [sent-21, score-0.608]
15 We subsequently introduce unlabeled data, define the expansion and formulate the EM algorithm for discriminative training. [sent-22, score-0.802]
16 In addition, we provide a purely discriminative version of the parameter estimation problem and formalize it as a maximum entropy discrimination problem. [sent-23, score-0.549]
17 2 Kernel density estimation and classification We start by assuming a large number of labeled examples D = {(Xl, ill)" . [sent-25, score-0.701]
18 , (XN, fiN)}' where ih E {-I, I} and Xi E Rf A joint kernel density estimate can be written as 1 P(x,y) = N N L (1) t5(Y,ih)K(x,Xi) i=l J where K(x, xi)dl-£(x) = 1 for each i. [sent-27, score-0.377]
19 With an appropriately chosen kernel K, a function of N, P(x, y) will be consistent in the sense of converging to the joint density as N -+ 00. [sent-28, score-0.269]
20 Given a fixed number of examples, the kernel functions K(x, Xi) may be viewed as conditional probabilities P(xli), where i indexes the observed points. [sent-29, score-0.275]
21 The labels ih assigned to the sampled points Xi may themselves be noisy and we incorporate P(yli), a locationspecific probability of labels. [sent-31, score-0.258]
22 The resulting joint density model is N P(x, y) = ~? [sent-32, score-0.083]
23 First, the number of aspects here equals the number of examples and the model is not suitable for clustering. [sent-39, score-0.172]
24 Second, we do not search for the probabilities P(xli) (kernels), instead they are associated with each observed example and are merely adjusted in terms of scale (kernel width). [sent-40, score-0.115]
25 The quality of the posterior probability depends both on how accurately P(yli) are known as well as on the properties of the membership probabilities P(ilx) (always known) that must be relatively smooth. [sent-43, score-0.197]
26 Here we provide a simple condition on the membership probabilities P(ilx) so that any noise in the sampled labels for the available examples would not preclude accurate decisions. [sent-44, score-0.535]
27 In other words, we wish to ensure that the conditional probabilities P(ylx) can be evaluated accurately on the basis of the sampled estimate in Eq. [sent-45, score-0.246]
28 Removing the label noise provides an alternative way of setting the width parameter rr of the Gaussian kernels. [sent-47, score-0.153]
29 The simple lemma below, obtained via standard large deviation methods, ties the appropriate choice of the kernel width rr to the squared norm of the membership probabilities P(iIXj). [sent-48, score-0.689]
30 = P(y = Iii) - P(y = The lemma applies to our case by setting Pilk = P(ilxk), {;iii} represents the sampled labels for the examples, and by noting that the sign of L wiP(ilx) is the MAP decision rule from our model, P(y = 11x) - P(y = -llx). [sent-54, score-0.241]
31 The lemma states that as long as the membership probabilities have appropriately bounded squared norm, the noise in the labeling is inconsequential for the classification decisions. [sent-55, score-0.438]
32 The squared norm of P(ilx) is directly controlled by the kernel width a 2 and thus the lemma ties the kernel width with the accuracy of estimating the conditional probabilities P(ylx). [sent-58, score-0.777]
33 Algorithms for adjusting the kernel width(s) on the basis of this will be presented in a longer version of the paper. [sent-59, score-0.248]
34 3 The expansion and EM estimation A useful way to view the resulting kernel density estimate is that each example x is represented by a vector of membership probabilities P(ilx), i = 1, . [sent-60, score-0.633]
35 The examples in this new representation are classified by associating P(yli) with each component and computing P(ylx) = Li P(yli)P(ilx). [sent-65, score-0.201]
36 An alternative approach to exploiting kernel density estimates in classification is given by [7]. [sent-66, score-0.408]
37 We now assume that we have labels for only a few examples, and our training data is {(X1dh), . [sent-67, score-0.1]
38 In other words, we can maximize the conditional log-likelihood L Z)og P(Yllxl) 1=1 L N 1=1 i=l = 2: log 2: P(jh li)P(ilxl) (2) where the first summation is only over the labeled examples and L « N . [sent-77, score-0.476]
39 Since P(ilxl) are fixed, this objective function is jointly concave in the free parameters and lends itself to a unique maximum value. [sent-78, score-0.093]
40 This procedure may have to be adjusted in cases where the overall frequency of different labels in the (labeled) training set deviates significantly from uniform. [sent-85, score-0.136]
41 The discriminative formulation suggests that EM will provide reasonable parameter estimates P(yli) for classification purposes. [sent-88, score-0.407]
42 The quality of the solution, as well as the potential for overfitting, is contingent on the smoothness of the kernels or, equivalently, smoothness of the membership probabilities P(ilx). [sent-89, score-0.374]
43 Actual classification decisions for unlabeled examples Xi (included in the expansion) need to be made on the basis of P(yIXi) and not on the basis of P(yli), which function as parameters. [sent-91, score-0.831]
44 4 Discriminative estimation An alternative discriminative formulation is also possible, one that is more sensitive to the decision boundary rather than probability values associated with the labels. [sent-92, score-0.486]
45 , Wi E [-1,1], so long as we wish to maintain the kernel density interpretation. [sent-98, score-0.241]
46 Instead, we employ the maximum entropy discrimination (MED) framework [5] and rely on the relation Wi = E{Yi} = L yi =±1 YiP(y) to estimate the distribution P(y) over all the labels Y = [YI,'" ,YN] . [sent-100, score-0.488]
47 We can show that in this case the maximum entropy solution factors across the examples P(YI, . [sent-102, score-0.381]
48 , YN) = TIi Pi (Yi) and we can formulate the estimation problem directly in terms of the marginals Pi (Yi). [sent-105, score-0.114]
49 The maximum entropy formalism encodes the principle that label assignments Pi (Yi) for the examples should remain uninformative to the extent possible given the classification objective. [sent-106, score-0.52]
50 More formally, given a set of L labeled examples (Xl, IiI), . [sent-107, score-0.438]
51 , (XL, ih), we maximize L~l H(Yi) - eLl el subject to the classification constraints (4) where H (Yi) is the entropy of Yi relative to the marginal Pi (Yi). [sent-110, score-0.316]
52 Here'Y specifies the target separation ('Y E [0,1]) and the slack variables el 2: a permit deviations from the target to ensure that a solution always exists. [sent-111, score-0.078]
53 The advantage of this formulation is that effort is spent only on those training examples whose classification is uncertain. [sent-114, score-0.39]
54 5 Discussion of the expanded representation The kernel expansion enables us to represent the Bayes optimal decision boundary provided that the kernel density estimate is sufficiently accurate. [sent-117, score-0.661]
55 With this representation, the EM and MED algorithms actually estimate decision boundaries that are sensitive to the density P(x). [sent-118, score-0.203]
56 For example, labeled points in high-density regions will influence the boundary more than in low-density regions . [sent-119, score-0.355]
57 The boundary will partly follow the density, but unlike in unsupervised methods, will adhere strongly to the labeled points. [sent-120, score-0.324]
58 Moreover, our estimation techniques limit the effect of outliers, as all points have a bounded weight Wi = [-1,1] (spurious unlabeled points do not adversely affect the boundary). [sent-121, score-0.571]
59 As we impose smoothness constraints on the membership probabilities P(ilx) , we also guarantee that the capacity of the resulting classifier need not increase with the number of unlabeled examples (in the fat shattering sense). [sent-122, score-0.906]
60 Also, in the context of the maximum entropy formulation , if a point is not helpful for the classification constraints, then entropy is maximized for Pi(y boundary. [sent-123, score-0.521]
61 5, implying Wi = 0, and the point has no effect on the If we dispense with the conditional probability interpretation of the kernels K, we are free to choose them from a more general class of functions. [sent-125, score-0.139]
62 For example, the kernels no longer have to integrate to 1. [sent-126, score-0.107]
63 An expansion of x in terms of these kernels can still be meaningful; as a special case, when linear kernels are chosen, the expansion reduces to weighting distances between points by the covariance of the data. [sent-127, score-0.335]
64 , support vector machines to take advantage of unlabeled data: we can expand the inputs x in terms of kernels G from labeled and unlabeled points as in ¢(x) = ~[G(x, Xl)' . [sent-131, score-1.254]
65 Figure la) demonstrates that this is not the case and, instead, the test classification error approaches the limiting asymptotic rate exponentially fast. [sent-136, score-0.262]
66 The problem considered was a DNA splice site classification problem with 500 examples for which d = 100. [sent-137, score-0.289]
67 Varying sizes of random subsets were labeled and all the examples were used in the expansion as unlabeled examples. [sent-138, score-0.971]
68 The error rate was computed on the basis of the remaining 500 - L examples without labels, where L denotes the number of labeled examples. [sent-139, score-0.521]
69 The exponential rate of convergence towards the limiting rate is evidenced by the linear trend in the semilog figure la). [sent-141, score-0.182]
70 The mean test errors shown in figure Ib) indicate that the purely discriminative training (MED) can contribute substantially to the accuracy. [sent-142, score-0.327]
71 The kernel width in these experiments was simply fixed to the median distance to the 5th nearest neighbor from the opposite class. [sent-143, score-0.23]
72 Results from other methods of choosing the kernel width (the squared norm, adaptive) will be discussed in the longer version of the paper. [sent-144, score-0.307]
73 Another concern is perhaps that the formulation is valid only in cases where we have a large number of unlabeled examples. [sent-145, score-0.55]
74 In principle, the method could deteriorate rapidly after the kernel density estimate no longer can be assumed to give reasonable estimates. [sent-146, score-0.328]
75 The problem here is to classify DNA micro array experiments on the basis of the leukemia types that the tissues used in the array experiments corresponded to. [sent-148, score-0.158]
76 The number of examples available was 38 for training and 34 for testing. [sent-150, score-0.202]
77 We included all examples as unlabeled points in the expansion and randomly selected subsets of labeled training examples, and measured the performance only on the test examples (which were of slightly different type and hence more appropriate for assessing generalization). [sent-151, score-1.244]
78 Figure 2 shows rapid convergence for EM and the discriminative MED formulation. [sent-152, score-0.218]
79 The "asymptotic" level here corresponds to about one classification error among the 34 test examples. [sent-153, score-0.145]
80 7 Conclusion We have provided a complementary framework for exploiting unlabeled examples in discriminative classification problems. [sent-155, score-1.035]
81 The framework involves a combination of the ideas of kernel density estimation and representational expansion of input vectors. [sent-156, score-0.391]
82 A simple EM o b) a) 35 ,----~-~-~-~-~_____, o 050L-~-----," 10--1C=--~ 0 -~,--------,J 5 2~ 25 30 labeled examples Figure 1: A semilog plot of the test error rate for the EM formulation less the asymptotic rate as a function of labeled examples. [sent-157, score-0.974]
83 The linear trend in the figure implies that the error rate approaches the asymptotic error exponentially fast. [sent-158, score-0.115]
84 b) The mean test errors for EM, MED and SVM as a function of the number of labeled examples. [sent-159, score-0.328]
85 o 35 ,------~-,------~--,----______, 03 025 g • ~ 02 ma 15 E 01 005 % ~-~-~ 0 -~ 17 1 5-~ 0 -~ ' 27 25 number of labeled exam ples Figure 2: The mean test errors for the leukemia classification problem as a function of the number of randomly chosen labeled examples. [sent-161, score-0.761]
86 algorithm is sufficient for finding globally optimal parameter estimates but we have shown that a purely discriminative formulation can yield substantially better results within the framework. [sent-163, score-0.337]
87 Possible extensions include using the kernel expansions with transductive algorithms that enforce margin constraints also for the unlabeled examples [5] . [sent-164, score-0.928]
88 (1999) Transductive inference for text classification using support vector machines. [sent-186, score-0.171]
89 (1996) The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. [sent-208, score-0.712]
90 A Maximum entropy solution The unique solution to the maximum entropy estimation problem is found via introducing Lagrange multipliers {AI} for the classification constraints. [sent-210, score-0.614]
91 The multipliers satisfy Al E [0, el, where the lower bound comes from the inequality constraints and the upper bound from the linear margin penalties being minimized. [sent-211, score-0.155]
92 To represent the solution and find the optimal setting of Al we must evaluate the partition function = e- ~f An N L II e~f = (5) = e- ~f An II (e~f Y1A1P(ilxd + e- ~f Y1A1P(ilxd ) (6) Z(A) iizA1YiP(ilxd N i=l that normalizes the maximum entropy distribution. [sent-212, score-0.252]
93 Minimizing the jointly convex log-partition function log Z(A) with respect to the Lagrange multipliers leads to the optimal setting {Ai}. [sent-214, score-0.076]
94 This optimization is readily done via an axis parallel line search (e. [sent-215, score-0.085]
95 The required gradients are given by 01 Z(A) O~Ak N = -')' + ~ tanh ( tt L ) = (7) = -')'+Yk LEp;{YdP(ilxk) (8) Y1Aj P(ilxl) YkP(ilxk) N i=l (this is essentially the classification constraint). [sent-218, score-0.145]
96 The expectation is taken with respect to the maximum entropy distribution P* (Yl , . [sent-219, score-0.179]
97 We can identify these from above wi = tanh(L:l Y1Aj P(ilxl)) and they are readily evaluated. [sent-228, score-0.144]
98 Often the numbers of positive and negative training labels are imbalanced. [sent-230, score-0.1]
99 The MED formulation (analogously to SVMs) can be adjusted by defining the margin penalties as e+ L:l:Y1=1 6 + e- L:l:Y1=-1 ~l' where, for example, L+e+ = L-e- that equalizes the mean penalties. [sent-231, score-0.182]
100 The coefficients e+ and e- can also be modified adaptively during the estimation process to balance the rate of misclassification errors across the two classes. [sent-232, score-0.132]
wordName wordTfidf (topN-words)
[('unlabeled', 0.446), ('yli', 0.373), ('labeled', 0.266), ('discriminative', 0.218), ('ilx', 0.207), ('med', 0.174), ('examples', 0.172), ('kernel', 0.158), ('ilxl', 0.149), ('yi', 0.122), ('membership', 0.118), ('classification', 0.117), ('entropy', 0.116), ('em', 0.11), ('labels', 0.1), ('xli', 0.1), ('wi', 0.092), ('ih', 0.091), ('expansion', 0.087), ('density', 0.083), ('ylx', 0.08), ('probabilities', 0.079), ('ilxd', 0.075), ('pill', 0.075), ('szummer', 0.075), ('width', 0.072), ('formulation', 0.072), ('kernels', 0.065), ('ilxk', 0.064), ('maximum', 0.063), ('estimation', 0.063), ('norm', 0.061), ('lemma', 0.061), ('boundary', 0.058), ('smoothness', 0.056), ('text', 0.054), ('label', 0.052), ('readily', 0.052), ('pi', 0.051), ('formulate', 0.051), ('exploiting', 0.05), ('asymptotic', 0.05), ('dyadic', 0.05), ('leukemia', 0.05), ('pilk', 0.05), ('semilog', 0.05), ('wip', 0.05), ('basis', 0.048), ('el', 0.048), ('purely', 0.047), ('multipliers', 0.046), ('estimate', 0.045), ('iii', 0.044), ('decision', 0.044), ('mitchell', 0.043), ('ly', 0.043), ('normalizes', 0.043), ('ties', 0.043), ('transduction', 0.043), ('transductive', 0.043), ('longer', 0.042), ('included', 0.042), ('discrimination', 0.042), ('modern', 0.039), ('expansions', 0.039), ('penalties', 0.039), ('dna', 0.039), ('conditional', 0.038), ('helpful', 0.037), ('adjusted', 0.036), ('sampled', 0.036), ('implying', 0.036), ('squared', 0.035), ('rate', 0.035), ('constraints', 0.035), ('margin', 0.035), ('errors', 0.034), ('tommi', 0.034), ('xi', 0.033), ('via', 0.033), ('xl', 0.032), ('complementary', 0.032), ('concern', 0.032), ('limiting', 0.032), ('points', 0.031), ('bayes', 0.031), ('sensitive', 0.031), ('solution', 0.03), ('jointly', 0.03), ('trend', 0.03), ('array', 0.03), ('available', 0.03), ('classified', 0.029), ('effort', 0.029), ('rr', 0.029), ('appropriately', 0.028), ('lagrange', 0.028), ('tanh', 0.028), ('expanded', 0.028), ('test', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 74 nips-2000-Kernel Expansions with Unlabeled Examples
Author: Martin Szummer, Tommi Jaakkola
Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.
2 0.17934217 144 nips-2000-Vicinal Risk Minimization
Author: Olivier Chapelle, Jason Weston, Léon Bottou, Vladimir Vapnik
Abstract: The Vicinal Risk Minimization principle establishes a bridge between generative models and methods derived from the Structural Risk Minimization Principle such as Support Vector Machines or Statistical Regularization. We explain how VRM provides a framework which integrates a number of existing algorithms, such as Parzen windows, Support Vector Machines, Ridge Regression, Constrained Logistic Classifiers and Tangent-Prop. We then show how the approach implies new algorithms for solving problems usually associated with generative models. New algorithms are described for dealing with pattern recognition problems with very different pattern distributions and dealing with unlabeled data. Preliminary empirical results are presented.
3 0.13165052 134 nips-2000-The Kernel Trick for Distances
Author: Bernhard Schölkopf
Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.
4 0.13124689 94 nips-2000-On Reversing Jensen's Inequality
Author: Tony Jebara, Alex Pentland
Abstract: Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown. 1
5 0.12640984 133 nips-2000-The Kernel Gibbs Sampler
Author: Thore Graepel, Ralf Herbrich
Abstract: We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise. 1
6 0.12185201 121 nips-2000-Sparse Kernel Principal Component Analysis
7 0.11895523 130 nips-2000-Text Classification using String Kernels
8 0.10941444 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
9 0.10674567 75 nips-2000-Large Scale Bayes Point Machines
10 0.10555323 110 nips-2000-Regularization with Dot-Product Kernels
11 0.098909549 54 nips-2000-Feature Selection for SVMs
12 0.0916664 58 nips-2000-From Margin to Sparsity
13 0.08626692 37 nips-2000-Convergence of Large Margin Separable Linear Classification
14 0.083325006 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
15 0.082701102 4 nips-2000-A Linear Programming Approach to Novelty Detection
16 0.077404998 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
17 0.077167675 120 nips-2000-Sparse Greedy Gaussian Process Regression
18 0.071550429 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method
19 0.071359843 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
20 0.07082954 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
topicId topicWeight
[(0, 0.261), (1, 0.167), (2, -0.024), (3, 0.05), (4, -0.023), (5, 0.101), (6, -0.043), (7, 0.0), (8, -0.007), (9, 0.039), (10, -0.063), (11, -0.024), (12, 0.095), (13, -0.011), (14, 0.075), (15, 0.021), (16, 0.087), (17, -0.067), (18, 0.015), (19, 0.032), (20, -0.216), (21, 0.159), (22, 0.229), (23, 0.108), (24, 0.04), (25, -0.107), (26, 0.064), (27, -0.111), (28, 0.077), (29, -0.108), (30, 0.058), (31, 0.041), (32, -0.017), (33, -0.039), (34, -0.19), (35, 0.044), (36, 0.037), (37, 0.059), (38, -0.031), (39, 0.12), (40, -0.003), (41, 0.0), (42, -0.117), (43, -0.079), (44, 0.055), (45, 0.05), (46, 0.015), (47, -0.099), (48, 0.095), (49, -0.245)]
simIndex simValue paperId paperTitle
same-paper 1 0.95309603 74 nips-2000-Kernel Expansions with Unlabeled Examples
Author: Martin Szummer, Tommi Jaakkola
Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.
2 0.68475968 144 nips-2000-Vicinal Risk Minimization
Author: Olivier Chapelle, Jason Weston, Léon Bottou, Vladimir Vapnik
Abstract: The Vicinal Risk Minimization principle establishes a bridge between generative models and methods derived from the Structural Risk Minimization Principle such as Support Vector Machines or Statistical Regularization. We explain how VRM provides a framework which integrates a number of existing algorithms, such as Parzen windows, Support Vector Machines, Ridge Regression, Constrained Logistic Classifiers and Tangent-Prop. We then show how the approach implies new algorithms for solving problems usually associated with generative models. New algorithms are described for dealing with pattern recognition problems with very different pattern distributions and dealing with unlabeled data. Preliminary empirical results are presented.
3 0.4335852 134 nips-2000-The Kernel Trick for Distances
Author: Bernhard Schölkopf
Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.
4 0.41755185 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
Author: Sebastian Mika, Gunnar R채tsch, Klaus-Robert M체ller
Abstract: We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD). A mathematical programming formulation based on the observation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. Simulations support the usefulness of our approach.
5 0.4050841 110 nips-2000-Regularization with Dot-Product Kernels
Author: Alex J. Smola, Zoltán L. Óvári, Robert C. Williamson
Abstract: In this paper we give necessary and sufficient conditions under which kernels of dot product type k(x, y) = k(x . y) satisfy Mercer's condition and thus may be used in Support Vector Machines (SVM), Regularization Networks (RN) or Gaussian Processes (GP). In particular, we show that if the kernel is analytic (i.e. can be expanded in a Taylor series), all expansion coefficients have to be nonnegative. We give an explicit functional form for the feature map by calculating its eigenfunctions and eigenvalues. 1
6 0.40269232 133 nips-2000-The Kernel Gibbs Sampler
7 0.39565358 54 nips-2000-Feature Selection for SVMs
8 0.38021427 94 nips-2000-On Reversing Jensen's Inequality
9 0.37426907 111 nips-2000-Regularized Winnow Methods
10 0.37048566 130 nips-2000-Text Classification using String Kernels
12 0.34542015 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
13 0.3157022 37 nips-2000-Convergence of Large Margin Separable Linear Classification
14 0.31389058 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
15 0.30976903 121 nips-2000-Sparse Kernel Principal Component Analysis
16 0.30560344 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
17 0.29894599 85 nips-2000-Mixtures of Gaussian Processes
18 0.28710443 75 nips-2000-Large Scale Bayes Point Machines
19 0.28324872 70 nips-2000-Incremental and Decremental Support Vector Machine Learning
20 0.27265424 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method
topicId topicWeight
[(10, 0.043), (17, 0.139), (32, 0.034), (33, 0.059), (55, 0.023), (62, 0.052), (65, 0.028), (67, 0.073), (75, 0.025), (76, 0.053), (79, 0.02), (81, 0.015), (89, 0.228), (90, 0.083), (91, 0.017), (97, 0.022)]
simIndex simValue paperId paperTitle
1 0.91877586 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition
Author: Jürgen Tchorz, Michael Kleinschmidt, Birger Kollmeier
Abstract: A novel noise suppression scheme for speech signals is proposed which is based on a neurophysiologically-motivated estimation of the local signal-to-noise ratio (SNR) in different frequency channels. For SNR-estimation, the input signal is transformed into so-called Amplitude Modulation Spectrograms (AMS), which represent both spectral and temporal characteristics of the respective analysis frame, and which imitate the representation of modulation frequencies in higher stages of the mammalian auditory system. A neural network is used to analyse AMS patterns generated from noisy speech and estimates the local SNR. Noise suppression is achieved by attenuating frequency channels according to their SNR. The noise suppression algorithm is evaluated in speakerindependent digit recognition experiments and compared to noise suppression by Spectral Subtraction. 1
same-paper 2 0.84394819 74 nips-2000-Kernel Expansions with Unlabeled Examples
Author: Martin Szummer, Tommi Jaakkola
Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.
3 0.64642322 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
Author: Claudio Gentile
Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.
4 0.63846475 4 nips-2000-A Linear Programming Approach to Novelty Detection
Author: Colin Campbell, Kristin P. Bennett
Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J
5 0.63797605 79 nips-2000-Learning Segmentation by Random Walks
Author: Marina Meila, Jianbo Shi
Abstract: We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1
6 0.6360575 111 nips-2000-Regularized Winnow Methods
7 0.62974358 52 nips-2000-Fast Training of Support Vector Classifiers
8 0.62662965 21 nips-2000-Algorithmic Stability and Generalization Performance
9 0.62639463 133 nips-2000-The Kernel Gibbs Sampler
10 0.62639284 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
11 0.62295896 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
12 0.62130201 122 nips-2000-Sparse Representation for Gaussian Process Models
13 0.62080663 37 nips-2000-Convergence of Large Margin Separable Linear Classification
14 0.6186074 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
15 0.61710012 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
16 0.614977 130 nips-2000-Text Classification using String Kernels
17 0.61211103 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
18 0.61113137 75 nips-2000-Large Scale Bayes Point Machines
19 0.61105818 22 nips-2000-Algorithms for Non-negative Matrix Factorization
20 0.61009914 60 nips-2000-Gaussianization