nips nips2000 nips2000-5 knowledge-graph by maker-knowledge-mining

5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm


Source: pdf

Author: Sebastian Mika, Gunnar R채tsch, Klaus-Robert M체ller

Abstract: We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD). A mathematical programming formulation based on the observation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. Simulations support the usefulness of our approach.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD). [sent-4, score-0.032]

2 A mathematical programming formulation based on the observation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. [sent-5, score-0.454]

3 We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. [sent-6, score-0.126]

4 From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. [sent-8, score-0.101]

5 1 Introduction Recent years have shown an enormous interest in kernel-based classification algorithms, primarily in Support Vector Machines (SVM) [2]. [sent-10, score-0.117]

6 The success of SVMs seems to be triggered by (i) their good generalization performance, (ii) the existence of a unique solution, and (iii) the strong theoretical background: structural risk minimization [12], supporting the good empirical results. [sent-11, score-0.396]

7 One of the key ingredients responsible for this success is the use of Mercer kernels, allowing for nonlinear decision surfaces which even might incorporate some prior knowledge about the problem to solve. [sent-12, score-0.319]

8 For our purpose, a Mercer kernel can be defined as a function k : IRn x IRn --+ IR, for which some (nonlinear) mapping ~ : IRn --+ F into afeature , pace F exists, such that k(x, y) = (~(x) . [sent-13, score-0.222]

9 Clearly, the s use of such kernel functions is not limited to SVMs. [sent-15, score-0.162]

10 The interpretation as a dot-product in another space makes it particularly easy to develop new algorithms: take any (usually) linear method and reformulate it using training samples only in dot-products, which are then replaced by the kernel. [sent-16, score-0.223]

11 In this article we consider algorithmic ideas for KFD. [sent-18, score-0.11]

12 Interestingly KFD - although exhibiting a similarly good performance as SVMs - has no explicit concept of a margin. [sent-19, score-0.183]

13 This is noteworthy since the margin is often regarded as explanation for good generalization in SVMs. [sent-20, score-0.27]

14 We will give an alternative formulation of KFD which makes the difference between both techniques explicit and allows a better understanding of the algorithms. [sent-21, score-0.226]

15 Another advantage of the new formulation is that we can derive more efficient algorithms for optimizing KFDs, that have e. [sent-22, score-0.157]

16 2 A Review of Kernel Fisher Discriminant The idea of the KFD is to solve the problem of Fisher's linear discriminant in a kernel feature space F , thereby yielding a nonlinear discriminant in the input space. [sent-25, score-0.803]

17 ,e} be our training sample and y E {-1, 1}l be the vector of corresponding labels. [sent-30, score-0.097]

18 Furthermore define 1 E ~l as the vector of all ones, 1 1 ,1 2 E ~l as binary (0,1) vectors corresponding to the class labels and let I, I l , andI2 be appropriate index sets over and the two classes, respectively (with i = IIil). [sent-31, score-0.111]

19 In the linear case, Fisher's discriminant is computed by maximizing the coefficient J( w) = (WTSBW)/(WTSww) of between and within class variance, i. [sent-32, score-0.327]

20 SB = (m2 - mt)(m2mll and Sw = Lk=1 ,2 LiEIk (Xi - mk)(Xi - mkl, where mk denotes the sample mean for class k. [sent-34, score-0.159]

21 To solve the problem in a kernel feature space F one needs a formulation which makes use of the training samples only in terms of dot-products. [sent-35, score-0.391]

22 One first shows [4], that there exists an expansion for w E F in terms of mapped training patterns, i. [sent-36, score-0.132]

23 e e (1) Using some straight forward algebra, the optimization problem for the KFD can then be (o. [sent-38, score-0.079]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kfd', 0.772), ('fisher', 0.246), ('discriminant', 0.227), ('kernel', 0.162), ('potsdam', 0.14), ('irn', 0.135), ('mika', 0.101), ('mk', 0.09), ('formulation', 0.09), ('mercer', 0.082), ('success', 0.079), ('svms', 0.066), ('gunnar', 0.06), ('ingredients', 0.06), ('noteworthy', 0.06), ('pace', 0.06), ('reformulate', 0.06), ('nonlinear', 0.058), ('yielding', 0.058), ('exhibiting', 0.055), ('ili', 0.055), ('ratsch', 0.055), ('surfaces', 0.055), ('sw', 0.055), ('programming', 0.051), ('good', 0.051), ('enormous', 0.051), ('sb', 0.051), ('explicit', 0.049), ('margin', 0.049), ('sparse', 0.049), ('understanding', 0.048), ('kij', 0.048), ('klaus', 0.048), ('straight', 0.048), ('triggered', 0.048), ('permits', 0.045), ('article', 0.045), ('kkt', 0.045), ('mt', 0.045), ('supporting', 0.045), ('unifying', 0.045), ('exists', 0.044), ('germany', 0.043), ('ill', 0.043), ('lk', 0.043), ('sebastian', 0.043), ('support', 0.042), ('mathematical', 0.041), ('algebra', 0.041), ('sparseness', 0.041), ('class', 0.04), ('xi', 0.04), ('outline', 0.039), ('modification', 0.039), ('regarded', 0.039), ('makes', 0.039), ('explanation', 0.038), ('miiller', 0.038), ('algorithmic', 0.038), ('ir', 0.038), ('primarily', 0.038), ('usefulness', 0.038), ('berlin', 0.036), ('thereby', 0.036), ('vector', 0.036), ('machines', 0.035), ('solve', 0.035), ('purpose', 0.035), ('optimizing', 0.035), ('interestingly', 0.035), ('labels', 0.035), ('responsible', 0.035), ('relevance', 0.034), ('generalization', 0.033), ('samples', 0.033), ('coefficient', 0.033), ('risk', 0.033), ('training', 0.032), ('interesting', 0.032), ('fix', 0.032), ('incorporate', 0.032), ('investigate', 0.032), ('replaced', 0.032), ('understood', 0.032), ('algorithms', 0.032), ('forward', 0.031), ('furthermore', 0.031), ('upon', 0.03), ('sample', 0.029), ('mapped', 0.029), ('structural', 0.029), ('concept', 0.028), ('years', 0.028), ('existence', 0.027), ('expansion', 0.027), ('interpretation', 0.027), ('ideas', 0.027), ('background', 0.027), ('maximizing', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

Author: Sebastian Mika, Gunnar R채tsch, Klaus-Robert M체ller

Abstract: We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD). A mathematical programming formulation based on the observation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. Simulations support the usefulness of our approach.

2 0.11225897 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

3 0.097336754 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

4 0.096298814 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1

5 0.093029842 54 nips-2000-Feature Selection for SVMs

Author: Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik

Abstract: We introduce a method of feature selection for Support Vector Machines. The method is based upon finding those features which minimize bounds on the leave-one-out error. This search can be efficiently performed via gradient descent. The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA micro array data.

6 0.090118736 121 nips-2000-Sparse Kernel Principal Component Analysis

7 0.081120625 75 nips-2000-Large Scale Bayes Point Machines

8 0.07619258 110 nips-2000-Regularization with Dot-Product Kernels

9 0.074103326 133 nips-2000-The Kernel Gibbs Sampler

10 0.07082954 74 nips-2000-Kernel Expansions with Unlabeled Examples

11 0.068000168 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

12 0.064109422 4 nips-2000-A Linear Programming Approach to Novelty Detection

13 0.062123429 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

14 0.060338527 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

15 0.059364311 120 nips-2000-Sparse Greedy Gaussian Process Regression

16 0.054919891 44 nips-2000-Efficient Learning of Linear Perceptrons

17 0.051338103 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

18 0.050859563 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

19 0.048484329 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

20 0.047598794 122 nips-2000-Sparse Representation for Gaussian Process Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.15), (1, 0.153), (2, -0.072), (3, 0.054), (4, -0.062), (5, 0.104), (6, -0.037), (7, 0.023), (8, -0.025), (9, 0.012), (10, 0.005), (11, -0.059), (12, -0.056), (13, -0.009), (14, 0.074), (15, 0.0), (16, 0.084), (17, -0.007), (18, 0.01), (19, -0.017), (20, -0.061), (21, -0.038), (22, 0.022), (23, 0.067), (24, 0.003), (25, -0.08), (26, 0.045), (27, -0.039), (28, -0.076), (29, -0.064), (30, -0.013), (31, 0.046), (32, -0.081), (33, -0.078), (34, 0.031), (35, -0.089), (36, 0.104), (37, -0.018), (38, -0.054), (39, 0.022), (40, 0.067), (41, 0.016), (42, -0.09), (43, 0.032), (44, -0.039), (45, -0.038), (46, 0.191), (47, -0.118), (48, -0.02), (49, 0.165)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93603814 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

Author: Sebastian Mika, Gunnar R채tsch, Klaus-Robert M체ller

Abstract: We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD). A mathematical programming formulation based on the observation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. Simulations support the usefulness of our approach.

2 0.5893662 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

3 0.58794832 110 nips-2000-Regularization with Dot-Product Kernels

Author: Alex J. Smola, Zoltán L. Óvári, Robert C. Williamson

Abstract: In this paper we give necessary and sufficient conditions under which kernels of dot product type k(x, y) = k(x . y) satisfy Mercer's condition and thus may be used in Support Vector Machines (SVM), Regularization Networks (RN) or Gaussian Processes (GP). In particular, we show that if the kernel is analytic (i.e. can be expanded in a Taylor series), all expansion coefficients have to be nonnegative. We give an explicit functional form for the feature map by calculating its eigenfunctions and eigenvalues. 1

4 0.5190475 54 nips-2000-Feature Selection for SVMs

Author: Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik

Abstract: We introduce a method of feature selection for Support Vector Machines. The method is based upon finding those features which minimize bounds on the leave-one-out error. This search can be efficiently performed via gradient descent. The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA micro array data.

5 0.51783532 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1

6 0.40331569 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

7 0.38338679 52 nips-2000-Fast Training of Support Vector Classifiers

8 0.37714538 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

9 0.37183523 74 nips-2000-Kernel Expansions with Unlabeled Examples

10 0.369111 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

11 0.36419085 70 nips-2000-Incremental and Decremental Support Vector Machine Learning

12 0.3600961 121 nips-2000-Sparse Kernel Principal Component Analysis

13 0.33236951 58 nips-2000-From Margin to Sparsity

14 0.31267095 44 nips-2000-Efficient Learning of Linear Perceptrons

15 0.30343124 116 nips-2000-Sex with Support Vector Machines

16 0.30188471 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

17 0.29432273 120 nips-2000-Sparse Greedy Gaussian Process Regression

18 0.28538135 75 nips-2000-Large Scale Bayes Point Machines

19 0.27095306 133 nips-2000-The Kernel Gibbs Sampler

20 0.26753679 12 nips-2000-A Support Vector Method for Clustering


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.05), (17, 0.172), (33, 0.06), (54, 0.021), (55, 0.011), (56, 0.329), (62, 0.017), (65, 0.03), (67, 0.062), (76, 0.067), (81, 0.014), (90, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82785404 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

Author: Sebastian Mika, Gunnar R채tsch, Klaus-Robert M체ller

Abstract: We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD). A mathematical programming formulation based on the observation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. Simulations support the usefulness of our approach.

2 0.79920805 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition

Author: In Jae Myung, Mark A. Pitt, Shaobo Zhang, Vijay Balasubramanian

Abstract: How should we decide among competing explanations of a cognitive process given limited observations? The problem of model selection is at the heart of progress in cognitive science. In this paper, Minimum Description Length (MDL) is introduced as a method for selecting among computational models of cognition. We also show that differential geometry provides an intuitive understanding of what drives model selection in MDL. Finally, adequacy of MDL is demonstrated in two areas of cognitive modeling. 1 Model Selection and Model Complexity The development and testing of computational models of cognitive processing are a central focus in cognitive science. A model embodies a solution to a problem whose adequacy is evaluated by its ability to mimic behavior by capturing the regularities underlying observed data. This enterprise of model selection is challenging because of the competing goals that must be satisfied. Traditionally, computational models of cognition have been compared using one of many goodness-of-fit measures. However, use of such a measure can result in the choice of a model that over-fits the data, one that captures idiosyncracies in the particular data set (i.e., noise) over and above the underlying regularities of interest. Such models are considered complex, in that the inherent flexibility in the model enables it to fit diverse patterns of data. As a group, they can be characterized as having many parameters that are combined in a highly nonlinear fashion in the model equation. They do not assume a single structure in the data. Rather, the model contains multiple structures; each obtained by finely tuning the parameter values of the model, and thus can fit a wide range of data patterns. In contrast, simple models, frequently with few parameters, assume a specific structure in the data, which will manifest itself as a narrow range of similar data patterns. Only when one of these patterns occurs will the model fit the data well. The problem of over-fitting data due to model complexity suggests that the goal of model selection should instead be to select the model that generalizes best to all data samples that arise from the same underlying regularity, thus capturing only the regularity, not the noise. To achieve this goal, the selection method must be sensitive to the complexity of a model. There are at least two independent dimensions of model complexity. They are the number of free parameters of a model and its functional form, which refers to the way the parameters are combined in the model equation. For instance, it seems unlikely that two one-parameter models, y = ex and y = x 9, are equally complex in their ability to fit data. The two dimensions of model complexity (number of parameters and functional form) and their interplay can improve a model's fit to the data, without necessarily improving generalizability. The trademark of a good model selection procedure, then, is its ability to satisfy two opposing goals. A model must be sufficiently complex to describe the data sample accurately, but without over-fitting the data and thus losing generalizability. To achieve this end, we need a theoretically well-justified measure of model complexity that takes into account the number of parameters and the functional form of a model. In this paper, we introduce Minimum Description Length (MDL) as an appropriate method of selecting among mathematical models of cognition. We also show that MDL has an elegant geometric interpretation that provides a clear, intuitive understanding of the meaning of complexity in MDL. Finally, application examples of MDL are presented in two areas of cognitive modeling. 1.1 Minimum Description Length The central thesis of model selection is the estimation of a model's generalizability. One approach to assessing generalizability is the Minimum Description Length (MDL) principle [1]. It provides a theoretically well-grounded measure of complexity that is sensitive to both dimensions of complexity and also lends itself to intuitive, geometric interpretations. MDL was developed within algorithmic coding theory to choose the model that permits the greatest compression of data. A model family f with parameters e assigns the likelihood f(yle) to a given set of observed data y . The full form of the MDL measure for such a model family is given below. MDL = -In! (yISA) + ~ln( ; )+ In f dS.jdetl(S) where SA is the parameter that maximizes the likelihood, k is the number of parameters in the model, N is the sample size and I(e) is the Fisher information matrix. MDL is the length in bits of the shortest possible code that describes the data with the help of a model. In the context of cognitive modeling, the model that minimizes MDL uncovers the greatest amount of regularity (i.e., knowledge) underlying the data and therefore should be selected. The first, maximized log likelihood term is the lack-of-fit measure, and the second and third terms constitute the intrinsic complexity of the model. In particular, the third term captures the effects of complexity due to functional form, reflected through I(e). We will call the latter two terms together the geometric complexity of the model, for reasons that will become clear in the remainder of this paper. MDL arises as a finite series of terms in an asymptotic expansion of the Bayesian posterior probability of a model given the data for a special form of the parameter prior density [2] . Hence in essence, minimization of MDL is equivalent to maximization of the Bayesian posterior probability. In this paper we present a geometric interpretation of MDL, as well as Bayesian model selection [3], that provides an elegant and intuitive framework for understanding model complexity, a central concept in model selection. 2 Differential Geometric Interpretation of MDL From a geometric perspective, a parametric model family of probability distributions forms a Riemannian manifold embedded in the space of all probability distributions [4]. Every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise to a hyper-surface in which

3 0.51917744 4 nips-2000-A Linear Programming Approach to Novelty Detection

Author: Colin Campbell, Kristin P. Bennett

Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J

4 0.51839656 74 nips-2000-Kernel Expansions with Unlabeled Examples

Author: Martin Szummer, Tommi Jaakkola

Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.

5 0.51833075 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

6 0.51568282 130 nips-2000-Text Classification using String Kernels

7 0.51501286 122 nips-2000-Sparse Representation for Gaussian Process Models

8 0.51430243 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

9 0.51108783 111 nips-2000-Regularized Winnow Methods

10 0.51020634 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

11 0.50866961 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

12 0.50794005 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

13 0.50775158 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

14 0.50724804 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

15 0.50422436 133 nips-2000-The Kernel Gibbs Sampler

16 0.50398916 36 nips-2000-Constrained Independent Component Analysis

17 0.50177288 21 nips-2000-Algorithmic Stability and Generalization Performance

18 0.49604899 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

19 0.49603081 79 nips-2000-Learning Segmentation by Random Walks

20 0.49425948 51 nips-2000-Factored Semi-Tied Covariance Matrices