nips nips2000 nips2000-130 knowledge-graph by maker-knowledge-mining

130 nips-2000-Text Classification using String Kernels

Source: pdf

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We introduce a novel kernel for comparing two text documents. [sent-4, score-0.666]

2 The kernel is an inner product in the feature space consisting of all subsequences of length k. [sent-5, score-1.003]

3 A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. [sent-6, score-0.548]

4 The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. [sent-7, score-0.485]

5 A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. [sent-8, score-0.42]

6 The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. [sent-9, score-0.274]

7 A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. [sent-10, score-1.158]

8 1 Introduction Standard learning systems (like neural networks or decision trees) operate on input data after they have been transformed into feature vectors XI, ••• , Xl E X from an n dimensional space. [sent-11, score-0.243]

9 There are cases, however, where the input data can not be readily described by explicit feature vectors: for example biosequences, images, graphs and text documents. [sent-12, score-0.482]

10 For such datasets, the construction of a feature extraction module can be as complex and expensive as solving the entire problem. [sent-13, score-0.223]

11 An effective alternative to explicit feature extraction is provided by kernel methods. [sent-14, score-0.638]

12 Kernel-based learning methods use an implicit mapping ofthe input data into a high dimensional feature space defined by a kernel function, i. [sent-15, score-0.628]

13 a function returning the inner product between the images of two data points in the feature space. [sent-17, score-0.317]

14 The learning then takes place in the feature space, provided the learning algorithm can be entirely rewritten so that the data points only appear inside dot products with other data points. [sent-18, score-0.237]

15 One interesting property of kernel-based systems is that, once a valid kernel function has been selected, one can practically work in spaces of any dimensionality without paying any computational cost, since the feature mapping is never effectively performed. [sent-21, score-0.657]

16 In fact, one does not even need to know what features are being used. [sent-22, score-0.049]

17 In this paper we examine the use of a kernel method based on string alignment for text categorization problems. [sent-23, score-1.095]

18 A standard approach [5] to text categorisation makes use of the so-called bag of words (BOW) representation, mapping a document to a bag (i. [sent-24, score-0.697]

19 a set that counts repeated elements), hence losing all the word order information and only retaining the frequency of the terms in the document. [sent-26, score-0.237]

20 This is usually accompanied by the removal of non-informative words (stop words) and by the replacing of words by their stems, so losing inflection information. [sent-27, score-0.167]

21 In this paper we propose a radically different approach, that considers documents simply as symbol sequences, and makes use of specific kernels. [sent-29, score-0.137]

22 The approach is entirely subsymbolic, in the sense that it considers the document just like a unique long sequence, and still it is capable to capture topic information. [sent-30, score-0.29]

23 We build on recent advances [11, 4] that demonstrated how to build kernels over general structures like sequences. [sent-31, score-0.132]

24 The most remarkable property of such methods is that they map documents to vectors without explicitly representing them, by means of sequence alignment techniques. [sent-32, score-0.27]

25 A dynamic programming technique makes the computation of the kernels very efficient (linear in the documents length). [sent-33, score-0.298]

26 Support Vector Machines [3] are linear classifiers in a kernel defined feature space. [sent-35, score-0.573]

27 The kernel is a function which returns the dot product of the feature vectors ¢(x) and ¢(X') of two inputs x and x' K(x, x') = ¢(x)T ¢(X'). [sent-36, score-0.664]

28 Choosing very high dimensional feature spaces ensures that the required functionality can be obtained using linear classifiers. [sent-37, score-0.304]

29 The computational difficulties of working in such feature spaces is avoided by using a dual representation of the linear functions in terms of the training set S = {(Xl, Y1) ,(X2, Y2), . [sent-38, score-0.265]

30 ;=1 The danger of overfitting by resorting to such a high dimensional space is averted by maximising the margin or a related soft version of this criterion, a strategy that has been shown to ensure good generalisation despite the high dimensionality [9,8]. [sent-42, score-0.171]

31 2 A Kernel for Text Sequences In this section we describe a kernel between two text documents. [sent-43, score-0.666]

32 The idea is to compare them by means of the substrings they contain: the more substrings in common, the more similar they are. [sent-44, score-0.34]

33 An important part is that such substrings do not need to be contiguous, and the degree of contiguity of one such substring in a document determines how much weight it will have in the comparison. [sent-45, score-0.492]

34 For example: the substring 'c-a-r' is present both in the word 'card' and in the word ' custard', but with different weighting. [sent-46, score-0.422]

35 For each such substring there is a dimension of the feature space, and the value of such coordinate depends on how frequently and how compactly such string is embedded in the text. [sent-47, score-0.597]

36 E (0,1) that can be used to weight the presence of a certain feature in a text (see Definition 1 for more details). [sent-49, score-0.448]

37 If we consider only k = 2, we obtain an 8-dimensional feature space, where the words are mapped as follows: rp(cat) rp(car) rp(bat) rp(bar) c-a ). [sent-52, score-0.221]

38 3 Hence, the unnormalized kernel between car and cat is K(car,cat) = ). [sent-64, score-0.529]

39 Note that in general the document will contain more than one word, but the mapping for the whole document is into one feature space. [sent-72, score-0.565]

40 However, for interesting substring sizes (eg > 4) direct computation of all the relevant features would be impractical even for moderately sized texts and hence explicit use of such representation would be impossible. [sent-74, score-0.522]

41 But it turns out that a kernel using such features can be defined and calculated in a very efficient way by using dynamic progamming techniques. [sent-75, score-0.483]

42 We derive the kernel by starting from the features and working out their inner product. [sent-76, score-0.536]

43 In this case there is no need to prove that it satisfies Mercer's conditions (symmetry and positive semi-definiteness) since they will follow automatically from its definition as an inner product. [sent-77, score-0.148]

44 This kernel is based on work [11, 4] mostly motivated by bioinformatics applications. [sent-78, score-0.381]

45 It maps strings to a feature vector indexed by all k tuples of characters. [sent-79, score-0.46]

46 A k-tuple will have a non-zero entry if it occurs as a subsequence anywhere (not necessarily contiguously) in the string. [sent-80, score-0.154]

47 The weighting of the feature will be the sum over the occurrences of the k-tuple of a decaying factor of the length of the occurrence. [sent-81, score-0.434]

48 Definition 1 (String subsequence kernel) Let ~ be a finite alphabet. [sent-82, score-0.154]

49 A string is a finite sequence of characters from~, including the empty sequence. [sent-83, score-0.371]

50 For strings s, t, we denote by Is I the length of the string s = Sl . [sent-84, score-0.599]

51 sl s I, and by st the string obtained by concatenating the strings sand t. [sent-86, score-0.583]

52 The string sri : j] is the substring Si ••• Sj of s. [sent-87, score-0.461]

53 We say that u is a subsequence of s, if there exist indices i = (i l , . [sent-88, score-0.154]

54 The length l(i) of the subsequence in s is ilul - i l + 1. [sent-98, score-0.26]

55 We denote by ~n the set of all finite strings of length n, and by~· the set of all strings DO (1) We now define feature spaces Fn = lR 1: n • The feature mapping rp for a string s is given by defining the u coordinate rpu (s) for each u E ~n. [sent-99, score-1.471]

56 These features measure the number of occurrences of subsequences in the string-s weighting them according to their lengths. [sent-104, score-0.356]

57 Hence, the inner product of the feature vectors for two strings sand t give a sum over all common subsequences weighted according to their frequency of occurrence and lengths L ( (s) (t) ) \ (s) . [sent-105, score-0.845]

58 (t)) K(s,t) = )K(s, s)K(t, t) The normalised kernel introduced above was implemented using the recursive formulas described above. [sent-108, score-0.381]

59 The next section gives some more details of the algorithmics and this is followed by a section describing the results of applying the kernel in a Support Vector Machine for text classification. [sent-109, score-0.723]

60 3 Algorithmics In this section we describe how special design techniques provide a significant speedup of the procedure, by both accelerating the kernel evaluations and reducing their number. [sent-110, score-0.381]

61 In order to deal with large datasets, we used a form of chunking: beginning with a very small subset of the data and gradually building up the size of the training set, while ensuring that only points which failed to meet margin 1 on the current hypothesis were included in the next chunk. [sent-112, score-0.131]

62 Since each evaluation of the kernel function requires not neglect able computational resources, we designed the system so to only calculate those entries of the kernel matrix that are actually required by the training algorithm. [sent-113, score-0.822]

63 This can significantly reduce the training time, since only a relatively small part of the kernel matrix is actually used by our implementation of SVM. [sent-114, score-0.412]

64 Special care in the implementation of the kernel described in Definition 1 can significantly speed-up its evaluation. [sent-115, score-0.381]

65 As can be seen from the description of the recursion 2, in Definition 2, its computation takes time proportional to n Is Iii 1 as the outermost recursion is over the sequence length and for each length and each additional character in sand i a sum over the sequence i must be evaluated. [sent-116, score-0.568]

66 The complexity of the computation can be reduced to 0 (n Is Iii I), by first evaluating K;'(sx, i) = L Kf_l(s, i[l : j - 1]). [sent-117, score-0.065]

67 These observations together give an O( lsl lt l) recursion for computing K:'(s, t). [sent-124, score-0.102]

68 Hence, we can evaluate the overall kernel in O(n lslltl) time. [sent-125, score-0.381]

69 4 Experimental Results Our aim was to test the efficacy of this new approach to feature extraction for text categorization, and to compare with a state-of-the-art system such as the one used in [6]. [sent-126, score-0.508]

70 As expected, using longer substrings in the comparison of two documents gives an improved performance. [sent-128, score-0.273]

71 We used the same dataset as that reported in [6], namely the Reuters-21578 [7], as well as the Medline doucment collection of 1033 document abstracts from the National Library of Medicine. [sent-129, score-0.217]

72 We performed all of our experiments on a subset of four categories, 'earn', 'acq', 'crude', and 'corn'. [sent-130, score-0.031]

73 We applied the two different kernels to a subset of Reuters of 380 training examples and 90 test examples. [sent-133, score-0.136]

74 The only difference in the experiments was the kernel used. [sent-134, score-0.381]

75 The splits of the data were had the following sizes and numbers of positive examples in training and test sets: numbers of positive examples in training (testing) set out of 370 (90): earn 152 (40); 114 (25); 76 (15); 38 (10) in the Reuters database. [sent-135, score-0.309]

76 The preliminary experiment used different values of k, in order to identify the optimal one, with the category 'earn' . [sent-136, score-0.111]

77 The follwing experiments all used a sequence length of 5 for the string subsequences kernel. [sent-137, score-0.625]

78 The results obtained are shown in the following where the precision, recall and F1 values are shown for both kernels. [sent-140, score-0.095]

79 867 # SV 138 237 268 250 Table 1: F1, Precision, Recall and number of Support Vectors for top reuter category earn averaged over 10 splits (n S-K == string kernel oflength n, W-K == word kernel earn acq crude corn F1 0. [sent-153, score-1.473]

80 71 # SV 250 276 262 264 Table 2: Precision, Recall and F1 numbers for 4 categories for the two kernels: word kernel (W-K) and subsequences kernel (5 S-K) The results are better in one category, similar or slightly better for the other categories. [sent-179, score-1.206]

81 They certainly indicate that the new kernel can outperform the more classical approach, but equally the performance is not reliably better. [sent-180, score-0.381]

82 T he last table shows the results obtained for two categories in medLine data, numbers 20 and 23. [sent-181, score-0.146]

83 636 (618) Table 3: F1 and number of Support Vectors for top two Medline queries 5 Conclusions The paper has presented a novel kernel for text analysis, and tested it on a categorization task, which relies on evaluating an inner product in a very high dimensional feature space. [sent-190, score-1.154]

84 For a given sequence length k (k = 5 was used in the experiments reported) the features are indexed by all strings of length k. [sent-191, score-0.586]

85 Direct computation of all the relevant features would be impractical even for moderately sized texts. [sent-192, score-0.272]

86 The paper has presented a dynamic programming style computation for computing the kernel directly from the input sequences without explicitly calculating the feature vectors. [sent-193, score-0.701]

87 Further refinements of the algorithm have resulted in a practical alternative to the more standard word feature based kernel used in previous SVM applications to text classification [6]. [sent-194, score-0.997]

88 We have presented an experimental comparison of the word feature kernel with our subsequences kernel on a benchmark dataset with encouraging results. [sent-195, score-1.309]

89 The results reported here are very preliminary and many questions remain to be resolved. [sent-196, score-0.085]

90 First more extensive experiments are required to gain a more reliable picture of the performance of the new kernel, including the effect of varying the subsequence length and the parameter). [sent-197, score-0.26]

91 The evaluation of the new kernel is still relatively time consuming and more research is needed to investigate ways of expediting this phase of the computation. [sent-199, score-0.41]

92 Text categorization with support vector machines: Learning with many relevant features. [sent-231, score-0.218]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kernel', 0.381), ('text', 0.285), ('string', 0.262), ('strings', 0.231), ('subsequences', 0.199), ('sx', 0.199), ('document', 0.18), ('substrings', 0.17), ('feature', 0.163), ('subsequence', 0.154), ('substring', 0.142), ('word', 0.14), ('sv', 0.124), ('length', 0.106), ('rp', 0.106), ('inner', 0.106), ('documents', 0.103), ('earn', 0.099), ('medline', 0.099), ('categorization', 0.096), ('recall', 0.095), ('ki', 0.085), ('occurrences', 0.077), ('car', 0.077), ('recursion', 0.074), ('kernels', 0.074), ('precision', 0.073), ('cat', 0.071), ('alignment', 0.071), ('spaces', 0.071), ('margin', 0.069), ('bag', 0.066), ('bat', 0.066), ('reuters', 0.066), ('rpu', 0.066), ('cristianini', 0.063), ('category', 0.063), ('extraction', 0.06), ('sand', 0.06), ('support', 0.06), ('sequence', 0.058), ('words', 0.058), ('sri', 0.057), ('sized', 0.057), ('moderately', 0.057), ('algorithmics', 0.057), ('decaying', 0.057), ('categories', 0.055), ('dynamic', 0.053), ('holloway', 0.051), ('characters', 0.051), ('losing', 0.051), ('numbers', 0.05), ('features', 0.049), ('preliminary', 0.048), ('product', 0.048), ('splits', 0.048), ('hence', 0.046), ('fn', 0.045), ('impractical', 0.045), ('encouraging', 0.045), ('definition', 0.042), ('dimensional', 0.042), ('mapping', 0.042), ('table', 0.041), ('entirely', 0.04), ('vectors', 0.038), ('svm', 0.038), ('acm', 0.037), ('reported', 0.037), ('programming', 0.036), ('sequences', 0.036), ('indexed', 0.036), ('topic', 0.036), ('explicit', 0.034), ('dot', 0.034), ('considers', 0.034), ('datasets', 0.033), ('evaluating', 0.033), ('computation', 0.032), ('relevant', 0.032), ('despite', 0.031), ('weighting', 0.031), ('subset', 0.031), ('training', 0.031), ('coordinate', 0.03), ('sl', 0.03), ('vector', 0.03), ('soft', 0.029), ('build', 0.029), ('royal', 0.029), ('technical', 0.029), ('classifiers', 0.029), ('evaluation', 0.029), ('punctuation', 0.028), ('refinements', 0.028), ('texts', 0.028), ('functionality', 0.028), ('lsl', 0.028), ('dortmund', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999911 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

2 0.23350225 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1

3 0.22621349 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

4 0.17402026 136 nips-2000-The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity

Author: David A. Cohn, Thomas Hofmann

Abstract: We describe a joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives. The model is based on a probabilistic factor decomposition and allows identifying principal topics of the collection as well as authoritative documents within those topics. Furthermore, the relationships between topics is mapped out in order to build a predictive model of link content. Among the many applications of this approach are information retrieval and search, topic identification, query disambiguation, focused web crawling, web authoring, and bibliometric analysis.

5 0.1530339 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

Author: Christopher K. I. Williams

Abstract: In this paper we show that the kernel peA algorithm of Sch6lkopf et al (1998) can be interpreted as a form of metric multidimensional scaling (MDS) when the kernel function k(x, y) is isotropic, i.e. it depends only on Ilx - yll. This leads to a metric MDS algorithm where the desired configuration of points is found via the solution of an eigenproblem rather than through the iterative optimization of the stress objective function. The question of kernel choice is also discussed. 1

6 0.15241855 6 nips-2000-A Neural Probabilistic Language Model

7 0.15096979 110 nips-2000-Regularization with Dot-Product Kernels

8 0.14363964 133 nips-2000-The Kernel Gibbs Sampler

9 0.13979363 4 nips-2000-A Linear Programming Approach to Novelty Detection

10 0.1395741 58 nips-2000-From Margin to Sparsity

11 0.13733146 54 nips-2000-Feature Selection for SVMs

12 0.13562509 75 nips-2000-Large Scale Bayes Point Machines

13 0.12207279 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

14 0.11895523 74 nips-2000-Kernel Expansions with Unlabeled Examples

15 0.10571405 12 nips-2000-A Support Vector Method for Clustering

16 0.098968603 138 nips-2000-The Use of Classifiers in Sequential Inference

17 0.096298814 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

18 0.090580128 120 nips-2000-Sparse Greedy Gaussian Process Regression

19 0.089413561 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

20 0.081062689 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.275), (1, 0.208), (2, -0.094), (3, 0.121), (4, -0.143), (5, 0.292), (6, -0.063), (7, 0.059), (8, -0.016), (9, 0.155), (10, 0.153), (11, -0.043), (12, -0.057), (13, 0.046), (14, -0.027), (15, 0.108), (16, 0.189), (17, -0.022), (18, 0.05), (19, 0.033), (20, 0.094), (21, -0.02), (22, 0.04), (23, -0.102), (24, 0.119), (25, -0.268), (26, 0.015), (27, -0.048), (28, 0.014), (29, 0.085), (30, -0.019), (31, 0.031), (32, -0.01), (33, 0.002), (34, 0.087), (35, -0.032), (36, -0.072), (37, -0.042), (38, -0.136), (39, 0.007), (40, 0.061), (41, -0.094), (42, 0.016), (43, -0.073), (44, 0.029), (45, 0.015), (46, -0.017), (47, 0.046), (48, 0.052), (49, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95852345 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

2 0.66433704 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

3 0.64092356 136 nips-2000-The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity

Author: David A. Cohn, Thomas Hofmann

4 0.58369744 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

5 0.57441401 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

Author: Sebastian Mika, Gunnar R채tsch, Klaus-Robert M체ller

Abstract: We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD). A mathematical programming formulation based on the observation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. Simulations support the usefulness of our approach.

6 0.55787253 110 nips-2000-Regularization with Dot-Product Kernels

7 0.55027068 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

8 0.47324818 6 nips-2000-A Neural Probabilistic Language Model

9 0.46637937 54 nips-2000-Feature Selection for SVMs

10 0.42453715 4 nips-2000-A Linear Programming Approach to Novelty Detection

11 0.41104752 133 nips-2000-The Kernel Gibbs Sampler

12 0.3941384 74 nips-2000-Kernel Expansions with Unlabeled Examples

13 0.38945213 138 nips-2000-The Use of Classifiers in Sequential Inference

14 0.3616994 12 nips-2000-A Support Vector Method for Clustering

15 0.35729289 75 nips-2000-Large Scale Bayes Point Machines

16 0.34094808 58 nips-2000-From Margin to Sparsity

17 0.33070374 52 nips-2000-Fast Training of Support Vector Classifiers

18 0.32232872 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

19 0.29277572 120 nips-2000-Sparse Greedy Gaussian Process Regression

20 0.27544445 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.045), (17, 0.165), (20, 0.289), (32, 0.02), (33, 0.069), (36, 0.011), (54, 0.015), (55, 0.029), (62, 0.035), (65, 0.015), (67, 0.063), (75, 0.015), (76, 0.049), (79, 0.015), (81, 0.037), (90, 0.047), (97, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80873209 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

2 0.56490403 74 nips-2000-Kernel Expansions with Unlabeled Examples

Author: Martin Szummer, Tommi Jaakkola

Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.

3 0.56158775 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

Author: Yee Whye Teh, Geoffrey E. Hinton

Abstract: We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by finding the highest relative probability pair among all pairs that consist of a test image and an image whose identity is known. Our method compares favorably with other methods in the literature. The generative model consists of a single layer of rate-coded, non-linear feature detectors and it has the property that, given a data vector, the true posterior probability distribution over the feature detector activities can be inferred rapidly without iteration or approximation. The weights of the feature detectors are learned by comparing the correlations of pixel intensities and feature activations in two phases: When the network is observing real data and when it is observing reconstructions of real data generated from the feature activations.

4 0.56092238 122 nips-2000-Sparse Representation for Gaussian Process Models

Author: Lehel Csatč´¸, Manfred Opper

Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.

5 0.56072426 4 nips-2000-A Linear Programming Approach to Novelty Detection

Author: Colin Campbell, Kristin P. Bennett

Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J

6 0.55902123 37 nips-2000-Convergence of Large Margin Separable Linear Classification

7 0.55731553 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

8 0.55441648 79 nips-2000-Learning Segmentation by Random Walks

9 0.55414468 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

10 0.55366319 133 nips-2000-The Kernel Gibbs Sampler

11 0.55122644 111 nips-2000-Regularized Winnow Methods

12 0.54953986 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

13 0.54703283 60 nips-2000-Gaussianization

14 0.54381609 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

15 0.54290211 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

16 0.54272288 21 nips-2000-Algorithmic Stability and Generalization Performance

17 0.54228425 36 nips-2000-Constrained Independent Component Analysis

18 0.53967482 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

19 0.53865111 51 nips-2000-Factored Semi-Tied Covariance Matrices

20 0.53728843 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning