nips nips2000 nips2000-4 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Colin Campbell, Kristin P. Bennett
Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. [sent-8, score-0.751]
2 It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. [sent-9, score-0.813]
3 One approach is to build a hypothesis estimating the support of the normal data i. [sent-10, score-0.389]
4 constructing a function which is positive in the region where the data is located and negative elsewhere. [sent-12, score-0.136]
5 Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. [sent-13, score-0.387]
6 In this paper we propose a simpler kernel method for estimating the support based on linear programming. [sent-14, score-0.291]
7 The method is easy to implement and can learn large datasets rapidly. [sent-15, score-0.07]
8 We demonstrate the method on medical and fault detection datasets. [sent-16, score-0.463]
9 An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. [sent-18, score-0.557]
10 For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. [sent-19, score-0.437]
11 For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. [sent-21, score-0.59]
12 Novel events can be highlighted by constructing a real-valued density estimation function. [sent-22, score-0.083]
13 However, here we will consider the simpler task of modelling the support of a data distribution i. [sent-23, score-0.164]
14 creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. [sent-25, score-0.169]
15 Recently kernel methods have been applied to this problem [4]. [sent-26, score-0.171]
16 In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. [sent-27, score-0.252]
17 Suppose the data points in input space are X i (with i = 1, . [sent-28, score-0.133]
18 Hence we can define separating hyperplanes in feature space by w . [sent-33, score-0.162]
19 ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. [sent-36, score-0.354]
20 Thus we can also define a decision function: (1) where z is a new data point. [sent-37, score-0.054]
21 This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . [sent-39, score-0.46]
22 The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i. [sent-40, score-0.653]
23 (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. [sent-45, score-0.402]
24 Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . [sent-48, score-0.555]
25 The objective is therefore to separate off the surface region constaining data from the region containing no data. [sent-50, score-0.21]
26 This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. [sent-51, score-0.684]
27 The learning task in dual form involves minimisation of: mIll s. [sent-52, score-0.078]
28 (6) However, the origin plays a special role in this model. [sent-60, score-0.087]
29 As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. [sent-61, score-0.709]
30 In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. [sent-62, score-0.581]
31 In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. [sent-63, score-0.523]
32 As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. [sent-64, score-1.024]
33 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. [sent-65, score-0.633]
34 This surface is defined as the level set, J(z) = 0, of some nonlinear function. [sent-66, score-0.09]
35 ;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. [sent-68, score-0.457]
36 We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i. [sent-69, score-0.169]
37 ; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. [sent-75, score-0.045]
38 These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. [sent-78, score-0.623]
39 Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. [sent-79, score-0.049]
40 Other constraints on the class of functions are possible, e. [sent-81, score-0.049]
41 To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. [sent-87, score-0.256]
42 In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). [sent-88, score-0.049]
43 The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. [sent-93, score-0.05]
44 With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. [sent-94, score-0.312]
45 Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. [sent-95, score-0.253]
46 Such approaches have been successfully applied to other support vector problems [6 , 2]. [sent-96, score-0.077]
47 Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. [sent-97, score-0.261]
48 Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. [sent-99, score-0.091]
49 In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J" = 0. [sent-100, score-0.477]
50 In Figure 2 four outliers lying outside a single cluster are ignored when the system is trained using a soft margin. [sent-102, score-0.376]
51 In Figure 3 we show the effect of using a modified RBF kernel J{ (Xi, Xj) = e- ix ,-xji/ 2 ,,2. [sent-103, score-0.171]
52 This kernel and the one in (3) use a measure X - y, thus J{(x, x) is constant and the points lie on the surface of a hypersphere in feature space. [sent-104, score-0.531]
53 As a consequence a hyperplane slicing through this hypersphere gives a closed boundary separating normal and abnormal in input space: however, we found other choices of kernels may not produce closed boundaries in input space. [sent-105, score-1.286]
54 ·02 ·03 '--~"-"--''--~-'----'-~'----'-----''----'-----'---' -035 -03 -025 -02 ·015 -01 -005 ODS 01 015 02 Figure 1: The solution in input space for the hyperplane minimising W(o:, b) equation (7). [sent-110, score-0.295]
55 A hard margin was used with RBF kernels trained using (J" = 0. [sent-111, score-0.283]
56 For detection of abnormalities in medical data we investigated performance on the Biomed dataset [5] from the Statlib data archive [14]. [sent-113, score-0.653]
57 04 02 ·06 ·06 '---~-~-~--'--~-~--~-' -DB -06 -0 4 02 04 06 08 Figure 2: In this example 4 outliers are ignored by using a soft margin (with A = 10. [sent-114, score-0.331]
58 :::'---'-_~_~~_-,-_~_~~_~-' ·03 01 Figure 3: The solution in input space for a modified RBF kernel K (Xi, Xj) (J" = 0. [sent-120, score-0.25]
59 5 e- 1x ,-xjl/2a 2 with This dataset consisted of 194 observations each with 4 attributes corresponding to measurements made on blood samples (15 observations with missing values were removed). [sent-121, score-0.435]
60 We trained the system on 100 randomly chosen normal observations from healthy patients. [sent-122, score-0.367]
61 The system was then tested on 27 normal observations and 67 observations which exhibited abnormalities due to the presense of a rare genetic disease. [sent-123, score-0.637]
62 In Figure 4 we plot the results for training the novelty detector using a hard margin and with RBF kernels. [sent-124, score-0.499]
63 This plot gives the error rate (as a percentage) on the yaxis, versus (J" on the x-axis with the solid curve giving the performance on normal observations in the test data and the dashed curve giving performance on abnormal observations. [sent-125, score-1.038]
64 Clearly, when (J" is very small the system puts a Gaussian of narrow width around each data point and hence all test data is labelled as abnormal. [sent-126, score-0.376]
65 1 all but 2 of the normal test observations are correctly labelled and 57 of the 67 abnormal observations are correctly labelled. [sent-128, score-1.165]
66 0 the solution has 1 normal test observation incorrectly labelled and 29 abnormal observations correctly labelled. [sent-130, score-0.955]
67 The kernel parameter (J" is therefore crucial is determining the balance between I~ 80 40 20 Figure 4: The error rate (as a percentage) on the y-axis, versus (J" on the x-axis. [sent-131, score-0.171]
68 The solid curve giving the performance on normal observations in the test data and the dashed curve giving performance on abnormal observations. [sent-132, score-1.038]
69 Future research on model selection may indicate a good choice for the kernel parameter. [sent-134, score-0.171]
70 However, if the dataset is large enough and some abnormal events are known then a validation study can be used to determine the kernel parameter - as we illustrate with the application below. [sent-135, score-0.709]
71 Fault detection is an important generic problem in the condition monitoring of machinery: failure to detect faults can lead to machine damage, while an oversensitive fault detection system can lead to expensive and unnecessary downtime. [sent-138, score-0.807]
72 An an example we will consider detection of 4 classes of fault in ball-bearing cages, which are often safety critical components in machines, vehicles and other systems such as aircraft wing flaps. [sent-139, score-0.35]
73 In this study we used a dataset from the Structural Integrity and Damage Assessment Network [15] . [sent-140, score-0.068]
74 Each instance consisted of 2048 samples of acceleration taken with a Bruel and Kjaer vibration analyser. [sent-141, score-0.112]
75 After pre-processing with a discrete Fast Fourier Transform each such instance had 32 attributes characterising the measured signals. [sent-142, score-0.036]
76 To train the system we used 913 normal instances on new ball-bearings. [sent-144, score-0.509]
77 Using RBF kernels the best value of (J" ((J" = 320. [sent-145, score-0.091]
78 0) was found using a validation study consisting of 913 new normal instances, 747 instances of type 1 faults and 996 instances of type 2 faults. [sent-146, score-1.06]
79 7% of normal instances were correctly labelled (913 instances), 100% of type 1 instances were correctly labelled (747 instances) and 53. [sent-148, score-1.287]
80 3% of type 2 instances were correctly labelled (996 instances). [sent-149, score-0.589]
81 Of course, with ample normal and abnormal data this problem could also be approached using a binary classifier instead. [sent-150, score-0.629]
82 Thus to evaluate performance on totally unseen abnormalities we tested the novelty detector on type 3 errors and type 4 errors (with 996 instances of each). [sent-151, score-1.025]
83 5% oftype 4 instances as abnormal- which was statistically significant against a background of 1. [sent-154, score-0.262]
84 4 Conclusion In this paper we have presented a new kernelised novelty detection algorithm which uses linear programming techniques rather than quadratic programming. [sent-156, score-0.556]
85 The algorithm is also very fast in execution: for the 913 training examples used in the experiments on condition monitoring the model was constructed in about 4 seconds using a Silicon Graphics Origin 200. [sent-158, score-0.096]
86 A tutorial on support vector machines for pattern recognition. [sent-174, score-0.077]
87 Support vector data description applied to machine vibration analysis. [sent-235, score-0.108]
wordName wordTfidf (topN-words)
[('abnormal', 0.36), ('instances', 0.262), ('novelty', 0.246), ('normal', 0.215), ('detection', 0.214), ('rbf', 0.183), ('hypersphere', 0.174), ('kernel', 0.171), ('hyperplane', 0.169), ('abnormalities', 0.15), ('fault', 0.136), ('margin', 0.133), ('labelled', 0.131), ('damage', 0.126), ('observations', 0.12), ('medical', 0.113), ('datapoints', 0.109), ('type', 0.106), ('bennett', 0.104), ('tax', 0.104), ('outside', 0.096), ('kernels', 0.091), ('correctly', 0.09), ('surface', 0.09), ('origin', 0.087), ('outliers', 0.085), ('generation', 0.082), ('bristol', 0.081), ('scholkopf', 0.078), ('support', 0.077), ('datasets', 0.07), ('cage', 0.07), ('campbell', 0.07), ('faults', 0.07), ('dataset', 0.068), ('ignored', 0.067), ('xi', 0.066), ('giving', 0.066), ('programming', 0.063), ('feature', 0.063), ('detector', 0.061), ('mill', 0.06), ('monitoring', 0.06), ('column', 0.059), ('hard', 0.059), ('curve', 0.059), ('consisted', 0.058), ('normality', 0.054), ('vibration', 0.054), ('data', 0.054), ('xj', 0.053), ('separating', 0.052), ('graphics', 0.05), ('loose', 0.05), ('lying', 0.05), ('simplex', 0.05), ('constructing', 0.049), ('constraints', 0.049), ('ei', 0.048), ('errors', 0.047), ('minimise', 0.047), ('minimising', 0.047), ('space', 0.047), ('soft', 0.046), ('restriction', 0.046), ('broken', 0.045), ('generic', 0.045), ('minimisation', 0.045), ('united', 0.045), ('estimating', 0.043), ('behaviour', 0.042), ('closed', 0.042), ('mining', 0.042), ('boundary', 0.041), ('implicitly', 0.041), ('percentage', 0.041), ('lp', 0.041), ('platt', 0.041), ('validation', 0.039), ('centre', 0.039), ('microsoft', 0.039), ('sphere', 0.039), ('test', 0.039), ('illustrate', 0.037), ('condition', 0.036), ('attributes', 0.036), ('choices', 0.036), ('lies', 0.036), ('clusters', 0.035), ('hence', 0.034), ('events', 0.034), ('region', 0.033), ('task', 0.033), ('lie', 0.033), ('smola', 0.033), ('measurements', 0.033), ('quadratic', 0.033), ('system', 0.032), ('input', 0.032), ('around', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 4 nips-2000-A Linear Programming Approach to Novelty Detection
Author: Colin Campbell, Kristin P. Bennett
Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J
2 0.33025199 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra
Author: Paul M. Hayton, Bernhard Schölkopf, Lionel Tarassenko, Paul Anuzis
Abstract: A system has been developed to extract diagnostic information from jet engine carcass vibration data. Support Vector Machines applied to novelty detection provide a measure of how unusual the shape of a vibration signature is, by learning a representation of normality. We describe a novel method for Support Vector Machines of including information from a second class for novelty detection and give results from the application to Jet Engine vibration analysis.
3 0.14579666 134 nips-2000-The Kernel Trick for Distances
Author: Bernhard Schölkopf
Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.
4 0.13979363 130 nips-2000-Text Classification using String Kernels
Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins
Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1
5 0.12256552 54 nips-2000-Feature Selection for SVMs
Author: Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik
Abstract: We introduce a method of feature selection for Support Vector Machines. The method is based upon finding those features which minimize bounds on the leave-one-out error. This search can be efficiently performed via gradient descent. The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA micro array data.
6 0.11803808 121 nips-2000-Sparse Kernel Principal Component Analysis
7 0.11579547 12 nips-2000-A Support Vector Method for Clustering
8 0.11506929 58 nips-2000-From Margin to Sparsity
9 0.11464895 37 nips-2000-Convergence of Large Margin Separable Linear Classification
10 0.11373305 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
11 0.11048046 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
12 0.10523777 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
13 0.10373563 110 nips-2000-Regularization with Dot-Product Kernels
14 0.10331471 133 nips-2000-The Kernel Gibbs Sampler
15 0.099167928 75 nips-2000-Large Scale Bayes Point Machines
16 0.098643243 18 nips-2000-Active Support Vector Machine Classification
17 0.094654731 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
18 0.092823751 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
19 0.089938648 44 nips-2000-Efficient Learning of Linear Perceptrons
20 0.082701102 74 nips-2000-Kernel Expansions with Unlabeled Examples
topicId topicWeight
[(0, 0.265), (1, 0.2), (2, -0.071), (3, 0.041), (4, -0.137), (5, 0.172), (6, 0.008), (7, 0.019), (8, -0.019), (9, 0.24), (10, 0.001), (11, 0.045), (12, -0.08), (13, -0.033), (14, 0.254), (15, -0.134), (16, -0.15), (17, -0.06), (18, -0.185), (19, -0.138), (20, 0.043), (21, -0.091), (22, 0.114), (23, -0.184), (24, -0.122), (25, 0.07), (26, -0.105), (27, 0.042), (28, 0.139), (29, 0.105), (30, -0.153), (31, -0.049), (32, 0.091), (33, 0.066), (34, -0.073), (35, -0.014), (36, -0.017), (37, 0.084), (38, 0.071), (39, 0.002), (40, -0.031), (41, -0.112), (42, 0.024), (43, -0.014), (44, 0.016), (45, -0.035), (46, -0.08), (47, -0.017), (48, -0.072), (49, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.95716053 4 nips-2000-A Linear Programming Approach to Novelty Detection
Author: Colin Campbell, Kristin P. Bennett
Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J
2 0.93299633 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra
Author: Paul M. Hayton, Bernhard Schölkopf, Lionel Tarassenko, Paul Anuzis
Abstract: A system has been developed to extract diagnostic information from jet engine carcass vibration data. Support Vector Machines applied to novelty detection provide a measure of how unusual the shape of a vibration signature is, by learning a representation of normality. We describe a novel method for Support Vector Machines of including information from a second class for novelty detection and give results from the application to Jet Engine vibration analysis.
3 0.42034057 134 nips-2000-The Kernel Trick for Distances
Author: Bernhard Schölkopf
Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.
4 0.40790007 130 nips-2000-Text Classification using String Kernels
Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins
Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1
5 0.40651029 12 nips-2000-A Support Vector Method for Clustering
Author: Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik
Abstract: We present a novel method for clustering using the support vector machine approach. Data points are mapped to a high dimensional feature space, where support vectors are used to define a sphere enclosing them. The boundary of the sphere forms in data space a set of closed contours containing the data. Data points enclosed by each contour are defined as a cluster. As the width parameter of the Gaussian kernel is decreased, these contours fit the data more tightly and splitting of contours occurs. The algorithm works by separating clusters according to valleys in the underlying probability distribution, and thus clusters can take on arbitrary geometrical shapes. As in other SV algorithms, outliers can be dealt with by introducing a soft margin constant leading to smoother cluster boundaries. The structure of the data is explored by varying the two parameters. We investigate the dependence of our method on these parameters and apply it to several data sets.
6 0.39866209 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
7 0.3880102 54 nips-2000-Feature Selection for SVMs
8 0.38085717 18 nips-2000-Active Support Vector Machine Classification
9 0.36019492 43 nips-2000-Dopamine Bonuses
10 0.35858449 44 nips-2000-Efficient Learning of Linear Perceptrons
11 0.35267079 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
12 0.35188204 148 nips-2000-`N-Body' Problems in Statistical Learning
13 0.32520318 110 nips-2000-Regularization with Dot-Product Kernels
14 0.30602324 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
15 0.29717976 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
16 0.29611447 133 nips-2000-The Kernel Gibbs Sampler
17 0.29311016 58 nips-2000-From Margin to Sparsity
18 0.29059467 121 nips-2000-Sparse Kernel Principal Component Analysis
19 0.28462902 37 nips-2000-Convergence of Large Margin Separable Linear Classification
20 0.27652287 70 nips-2000-Incremental and Decremental Support Vector Machine Learning
topicId topicWeight
[(10, 0.039), (17, 0.166), (32, 0.012), (33, 0.06), (55, 0.019), (58, 0.257), (62, 0.034), (65, 0.02), (67, 0.055), (75, 0.021), (76, 0.06), (79, 0.023), (81, 0.028), (90, 0.112), (91, 0.011), (97, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.84430546 4 nips-2000-A Linear Programming Approach to Novelty Detection
Author: Colin Campbell, Kristin P. Bennett
Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J
2 0.78189611 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
3 0.64355791 74 nips-2000-Kernel Expansions with Unlabeled Examples
Author: Martin Szummer, Tommi Jaakkola
Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.
4 0.63929462 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
Author: Claudio Gentile
Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.
5 0.63312107 52 nips-2000-Fast Training of Support Vector Classifiers
Author: Fernando Pérez-Cruz, Pedro Luis Alarcón-Diana, Angel Navia-Vázquez, Antonio Artés-Rodríguez
Abstract: In this communication we present a new algorithm for solving Support Vector Classifiers (SVC) with large training data sets. The new algorithm is based on an Iterative Re-Weighted Least Squares procedure which is used to optimize the SVc. Moreover, a novel sample selection strategy for the working set is presented, which randomly chooses the working set among the training samples that do not fulfill the stopping criteria. The validity of both proposals, the optimization procedure and sample selection strategy, is shown by means of computer experiments using well-known data sets. 1 INTRODUCTION The Support Vector Classifier (SVC) is a powerful tool to solve pattern recognition problems [13, 14] in such a way that the solution is completely described as a linear combination of several training samples, named the Support Vectors. The training procedure for solving the SVC is usually based on Quadratic Programming (QP) which presents some inherent limitations, mainly the computational complexity and memory requirements for large training data sets. This problem is typically avoided by dividing the QP problem into sets of smaller ones [6, 1, 7, 11], that are iteratively solved in order to reach the SVC solution for the whole set of training samples. These schemes rely on an optimizing engine, QP, and in the sample selection strategy for each sub-problem, in order to obtain a fast solution for the SVC. An Iterative Re-Weighted Least Squares (IRWLS) procedure has already been proposed as an alternative solver for the SVC [10] and the Support Vector Regressor [9], being computationally efficient in absolute terms. In this communication, we will show that the IRWLS algorithm can replace the QP one in any chunking scheme in order to find the SVC solution for large training data sets. Moreover, we consider that the strategy to decide which training samples must j oin the working set is critical to reduce the total number of iterations needed to attain the SVC solution, and the runtime complexity as a consequence. To aim for this issue, the computer program SV cradit have been developed so as to solve the SVC for large training data sets using IRWLS procedure and fixed-size working sets. The paper is organized as follows. In Section 2, we start by giving a summary of the IRWLS procedure for SVC and explain how it can be incorporated to a chunking scheme to obtain an overall implementation which efficiently deals with large training data sets. We present in Section 3 a novel strategy to make up the working set. Section 4 shows the capabilities of the new implementation and they are compared with the fastest available SVC implementation, SV Mlight [6]. We end with some concluding remarks. 2 IRWLS-SVC In order to solve classification problems, the SVC has to minimize Lp = ~llwI12+CLei- LJliei- LQi(Yi(¢(xifw+b)-l+ei) (1) i i i with respectto w, band ei and maximize it with respectto Qi and Jli, subject to Qi, Jli ~ 0, where ¢(.) is a nonlinear transformation (usually unknown) to a higher dimensional space and C is a penalization factor. The solution to (1) is defined by the Karush-Kuhn-Tucker (KKT) conditions [2]. For further details on the SVC, one can refer to the tutorial survey by Burges [2] and to the work ofVapnik [13, 14]. In order to obtain an IRWLS procedure we will first need to rearrange (1) in such a way that the terms depending on ei can be removed because, at the solution C - Qi - Jli = 0 Vi (one of the KKT conditions [2]) must hold. Lp = 1 Qi(l- Yi(¢T(Xi)W + b)) 211wl12 + L i = (2) where The weighted least square nature of (2) can be understood if ei is defined as the error on each sample and ai as its associated weight, where! IIwl1 2 is a regularizing functional. The minimization of (2) cannot be accomplished in a single step because ai = ai(ei), and we need to apply an IRWLS procedure [4], summarized below in tree steps: 1. Considering the ai fixed, minimize (2). 2. Recalculate ai from the solution on step 1. 3. Repeat until convergence. In order to work with Reproducing Kernels in Hilbert Space (RKHS), as the QP procedure does, we require that w = Ei (JiYi¢(Xi) and in order to obtain a non-zero b, that Ei {JiYi = O. Substituting them into (2), its minimum with respect to {Ji and b for a fixed set of ai is found by solving the following linear equation system l (3) IThe detailed description of the steps needed to obtain (3) from (2) can be found in [10]. where y = [Yl, Y2, ... Yn]T (4) 'r/i,j = 1, ... ,n 'r/i,j = 1, ... ,n (H)ij = YiYj¢T(Xi)¢(Xj) = YiyjK(Xi,Xj) (Da)ij = aio[i - j] 13 = [,81, ,82, ... (5) (6) (7) , ,8n]T and 0[·] is the discrete impulse function. Finally, the dependency of ai upon the Lagrange multipliers is eliminated using the KKT conditions, obtaining a, ai 2.1 ={~ ei Yi' eiYi < Yt.et. > - ° ° (8) IRWLS ALGORITHMIC IMPLEMENTATION The SVC solution with the IRWLS procedure can be simplified by dividing the training samples into three sets. The first set, SI, contains the training samples verifying < ,8i < C, which have to be determined by solving (3). The second one, S2, includes every training sample whose,8i = 0. And the last one, S3, is made up of the training samples whose ,8i = C. This division in sets is fully justified in [10]. The IRWLS-SVC algorithm is shown in Table 1. ° 0. Initialization: SI will contain every training sample, S2 = 0 and S3 = 0. Compute H. e_a = y, f3_a = 0, b_a = 0, G 13 = Gin, a = 1 and G b3 = G bi n . 1 Solve [ (H)Sb S1 + D(al S1 . =° = e-lt a, 3. ai = { ~ (13) S2 2. e ° 1[ (Y)Sl (f3)Sl ] (y ) ~1 b and (13) Ss = C DyH(f3 - f3_a) - (b - b_a)1 =[1- G 13 ] G b3 ' °. eiYi < e- _ > O'r/Z E SI U S2 U S3 tYt 4. Sets reordering: a. Move every sample in S3 with eiYi < to S2. b. Move every sample in SI with ,8i = C to S3. c. Move every sample in SI with ai = to S2 . d. Move every sample in S2 with ai :I to SI. 5. e_a = e, f3_a = 13, G 13 = (H)Sl,SS (f3)ss + (G in )Sl' b-lt = band Gb3 = -y~s (f3)ss + Gbin · 6. Go to step 1 and repeat until convergence. ei Yi ' ° ° ° Table 1: IRWLS-SVC algorithm. The IRWLS-SVC procedure has to be slightly modified in order to be used inside a chunk:ing scheme as the one proposed in [8, 6], such that it can be directly applied in the one proposed in [1]. A chunking scheme is needed to solve the SVC whenever H is too large to fit into memory. In those cases, several SVC with a reduced set of training samples are iteratively solved until the solution for the whole set is found. The samples are divide into a working set, Sw, which is solved as a full SVC problem, and an inactive set, Sin. If there are support vectors in the inactive set, as it might be, the inactive set modifies the IRWLSSVC procedure, adding a contribution to the independent term in the linear equation system (3) . Those support vectors in S in can be seen as anchored samples in S3, because their ,8i is not zero and can not be modified by the IRWLS procedure. Then, such contribution (Gin and G bin ) will be calculated as G 13 and G b3 are (Table 1, 5th step), before calling the IRWLS-SVC algorithm. We have already modified the IRWLS-SVC in Table 1 to consider Gin and G bin , which must be set to zero if the Hessian matrix, H, fits into memory for the whole set of training samples. The resolution of the SVC for large training data sets, employing as minimization engine the IRWLS procedure, is summarized in the following steps: 1. Select the samples that will form the working set. 2. Construct Gin = (H)Sw,Sin (f3)s.n and G bin = -yIin (f3)Sin 3. Solve the IRWLS-SVC procedure, following the steps in Table 1. 4. Compute the error of every training sample. 5. If the stopping conditions Yiei < C eiYi> -c leiYil < C 'Vii 'Vii 'Vii (Ji = 0 (Ji = C 0 < (Ji < C (9) (10) (11) are fulfilled, the SVC solution has been reached. The stopping conditions are the ones proposed in [6] and C must be a small value around 10 - 3 , a full discussion concerning this topic can be found in [6]. 3 SAMPLE SELECTION STRATEGY The selection of the training samples that will constitute the working set in each iteration is the most critical decision in any chunking scheme, because such decision is directly involved in the number of IRWLS-SVC (or QP-SVC) procedures to be called and in the number of reproducing kernel evaluations to be made, which are, by far, the two most time consuming operations in any chunking schemes. In order to solve the SVC efficiently, we first need to define a candidate set of training samples to form the working set in each iteration. The candidate set will be made up, as it could not be otherwise, with all the training samples that violate the stopping conditions (9)-(11); and we will also add all those training samples that satisfy condition (11) but a small variation on their error will make them violate such condition. The strategies to select the working set are as numerous as the number of problems to be solved, but one can think three different simple strategies: • Select those samples which do not fulfill the stopping criteria and present the largest Iei I values. • Select those samples which do not fulfill the stopping criteria and present the smallest Iei I values. • Select them randomly from the ones that do not fulfill the stopping conditions. The first strategy seems the more natural one and it was proposed in [6]. If the largest leil samples are selected we guanrantee that attained solution gives the greatest step towards the solution of (1). But if the step is too large, which usually happens, it will cause the solution in each iteration and the (Ji values to oscillate around its optimal value. The magnitude of this effect is directly proportional to the value of C and q (size of the working set), so in the case ofsmall C (C < 10) and low q (q < 20) it would be less noticeable. The second one is the most conservative strategy because we will be moving towards the solution of (1) with small steps. Its drawback is readily discerned if the starting point is inappropriate, needing too many iterations to reach the SVC solution. The last strategy, which has been implemented together with the IRWLS-SVC procedure, is a mid-point between the other two, but if the number of samples whose 0 < (3i < C increases above q there might be some iterations where we will make no progress (working set is only made up of the training samples that fulfill the stopping condition in (11)). This situation is easily avoided by introducing one sample that violates each one of the stopping conditions per class. Finally, if the cardinality of the candidate set is less than q the working set is completed with those samples that fulfil the stopping criteria conditions and present the least leil. In summary, the sample selection strategy proposed is 2 : 1. Construct the candidate set, Se with those samples that do not fulfill stopping conditions (9) and (10), and those samples whose (3 obeys 0 < (3i < C. 2. IfISel < ngot05. 3. Choose a sample per class that violates each one of the stopping conditions and move them from Se to the working set, SW. 4. Choose randomly n - ISw I samples from Se and move then to SW. Go to Step 6. 5. Move every sample form Se to Sw and then-ISwl samples that fulfill the stopping conditions (9) and (10) and present the lowest leil values are used to complete SW . 6. Go on, obtaining Gin and Gbin. 4 BENCHMARK FOR THE IRWLS-SVC We have prepared two different experiments to test both the IRWLS and the sample selection strategy for solving the SVc. The first one compares the IRWLS against QP and the second one compares the samples selection strategy, together with the IRWLS, against a complete solving procedure for SVC, the SV Mlight. In the first trial, we have replaced the LOQO interior point optimizer used by SV M1ig ht version 3.02 [5] by the IRWLS-SVC procedure in Table 1, to compare both optimizing engines with equal samples selection strategy. The comparison has been made over a Pentium ill-450MHz with 128Mb running on Window98 and the programs have been compiled using Microsoft Developer 6.0. In Table 2, we show the results for two data sets: the first q 20 40 70 Adult44781 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 21.25 20.70 0.61 0.39 20.60 19.22 1.01 0.17 21.15 18.72 2.30 0.46 Splice 2175 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 46.19 30.76 21.94 4.77 71.34 24.93 46.26 8.07 53.77 20.32 34.24 7.72 Table 2: CPU Time indicates the consume time in seconds for the whole procedure. The Optimize Time indicates the consume time in second for the LOQO or IRWLS procedure. one, containing 4781 training samples, needs most CPU resources to compute the RKHS and the second one, containing 2175 training samples, uses most CPU resources to solve the SVC for each Sw, where q indicates the size of the working set. The value of C has 2In what follows, I . I represents absolute value for numbers and cardinality for sets been set to 1 and 1000, respectively, and a Radial Basis Function (RBF) RKHS [2] has been employed, where its parameter a has been set, respectively, to 10 and 70. As it can be seen, the SV M1ig ht with IRWLS is significantly faster than the LOQO procedure in all cases. The kernel cache size has been set to 64Mb for both data sets and for both procedures. The results in Table 2 validates the IRWLS procedure as the fastest SVC solver. For the second trial, we have compiled a computer program that uses the IRWLS-SVC procedure and the working set selection in Section 3, we will refer to it as svcradit from now on. We have borrowed the chunking and shrinking ideas from the SV Mlight [6] for our computer program. To test these two programs several data sets have been used. The Adult and Web data sets have been obtained from 1. Platt's web page http://research.microsoft.comr jplatt/smo.html/; the Gauss-M data set is a two dimensional classification problem proposed in [3] to test neural networks, which comprises a gaussian random variable for each class, which highly overlap. The Banana, Diabetes and Splice data sets have been obtained from Gunnar Ratsch web page http://svm.first.gmd.der raetschl. The selection of C and the RKHS has been done as indicated in [11] for Adult and Web data sets and in http://svm.first.gmd.derraetschl for Banana, Diabetes and Splice data sets. In Table 3, we show the runtime complexity for each data set, where the value of q has been elected as the one that reduces the runtime complexity. Database Dim Adult6 Adult9 Adult! Web 1 Web7 Gauss-M Gauss-M Banana Banana Diabetes Splice 123 123 123 300 300 2 2 2 2 8 69 N Sampl. 11221 32562 1605 2477 24693 4000 4000 400 4900 768 2175 C a SV 1 1 1000 5 5 1 100 316.2 316.2 10 1000 10 10 10 10 10 1 1 1 1 2 70 4477 12181 630 224 1444 1736 1516 80 1084 409 525 q CPU time radit light radit light 150 130 100 100 150 70 100 40 70 40 150 40 70 10 10 10 10 10 70 40 10 20 118.2 1093.29 25.98 2.42 158.13 12.69 61.68 0.33 22.46 2.41 14.06 124.46 1097.09 113.54 2.36 124.57 48.28 3053.20 0.77 1786.56 6.04 49.19 Table 3: Several data sets runtime complexity, when solved with the short, and SV Mlight, light for short. s v c radit , radit for One can appreciate that the svcradit is faster than the SV M1ig ht for most data sets. For the Web data set, which is the only data set the SV Mlight is sligthly faster, the value of C is low and most training samples end up as support vector with (3i < C. In such cases the best strategy is to take the largest step towards the solution in every iteration, as the SV Mlig ht does [6], because most training samples (3i will not be affected by the others training samples (3j value. But in those case the value of C increases the SV c radit samples selection strategy is a much more appropriate strategy than the one used in SV Mlight. 5 CONCLUSIONS In this communication a new algorithm for solving the SVC for large training data sets has been presented. Its two major contributions deal with the optimizing engine and the sample selection strategy. An IRWLS procedure is used to solve the SVC in each step, which is much faster that the usual QP procedure, and simpler to implement, because the most difficult step is the linear equation system solution that can be easily obtained by LU decomposition means [12]. The random working set selection from the samples not fulfilling the KKT conditions is the best option if the working is be large, because it reduces the number of chunks to be solved. This strategy benefits from the IRWLS procedure, which allows to work with large training data set. All these modifications have been concreted in the svcradit solving procedure, publicly available at http://svm.tsc.uc3m.es/. 6 ACKNOWLEDGEMENTS We are sincerely grateful to Thorsten Joachims who has allowed and encouraged us to use his SV Mlight to test our IRWLS procedure, comparisons which could not have been properly done otherwise. References [1] B. E. Boser, I. M . Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In 5th Annual Workshop on Computational Learning Theory, Pittsburg, U.S.A., 1992. [2] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998. [3] S. Haykin. Neural Networks: A comprehensivefoundation. Prentice-Hall, 1994. [4] P. W. Holland and R. E. Welch. Robust regression using iterative re-weighted least squares. Communications of Statistics Theory Methods, A6(9):813-27, 1977. [5] T. Joachims. http://www-ai.infonnatik.uni-dortmund.de/forschung/verfahren Isvmlight Isvmlight.eng.html. Technical report, University of Dortmund, Informatik, AI-Unit Collaborative Research Center on 'Complexity Reduction in Multivariate Data', 1998. [6] T. Joachims. Making Large Scale SVM Learning Practical, In Advances in Kernel Methods- Support Vector Learning, Editors SchOlkopf, B., Burges, C. 1. C. and Smola, A. 1., pages 169-184. M.I.T. Press, 1999. [7] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In Proc. of the 1997 IEEE Workshop on Neural Networks for Signal Processing, pages 276-285, Amelia Island, U.S.A, 1997. [8] E. Osuna and F. Girosi. Reducing the run-time complexity of support vector machines. In ICPR'98, Brisbane, Australia, August 1998. [9] F. Perez-Cruz, A. Navia-Vazquez
6 0.62430739 111 nips-2000-Regularized Winnow Methods
7 0.62094843 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
8 0.62066877 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra
9 0.61831641 133 nips-2000-The Kernel Gibbs Sampler
10 0.61716443 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
11 0.61616725 21 nips-2000-Algorithmic Stability and Generalization Performance
12 0.61559778 130 nips-2000-Text Classification using String Kernels
13 0.61385691 79 nips-2000-Learning Segmentation by Random Walks
14 0.60908759 122 nips-2000-Sparse Representation for Gaussian Process Models
15 0.60647625 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets
16 0.60537773 75 nips-2000-Large Scale Bayes Point Machines
17 0.60367513 36 nips-2000-Constrained Independent Component Analysis
18 0.60268831 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
19 0.60267758 37 nips-2000-Convergence of Large Margin Separable Linear Classification
20 0.60244441 51 nips-2000-Factored Semi-Tied Covariance Matrices