jmlr jmlr2005 jmlr2005-28 knowledge-graph by maker-knowledge-mining

28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets

Source: pdf

Author: Juho Rousu, John Shawe-Taylor

Abstract: We present a sparse dynamic programming algorithm that, given two strings s and t, a gap penalty λ, and an integer p, computes the value of the gap-weighted length-p subsequences kernel. The algorithm works in time O(p|M| log |t|), where M = {(i, j)|si = t j } is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-speciﬁc match/gap penalties. The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efﬁcient. On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted. Keywords: kernel methods, string kernels, text categorization, sparse dynamic programming

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-speciﬁc match/gap penalties. [sent-9, score-1.113]

2 The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. [sent-10, score-0.228]

3 Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. [sent-11, score-0.478]

4 On large alphabets, the new sparse dynamic programming algorithm is the most efﬁcient. [sent-12, score-0.198]

5 On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted. [sent-13, score-0.461]

6 Keywords: kernel methods, string kernels, text categorization, sparse dynamic programming 1. [sent-14, score-0.443]

7 Introduction Machine learning algorithms working on sequence data are needed both in bioinformatics and text categorization and mining. [sent-15, score-0.095]

8 In contrast, standard machine learning algorithms work on feature vector representation, thus requiring a feature extraction phase to map sequence data into feature vectors. [sent-16, score-0.135]

9 Kernel methods (Vapnik, 1995; Cristianini and Shawe-Taylor, 2000) provide an efﬁcient way of tackling the problem of dimensionality via the use of a kernel function, corresponding to the inner product of two feature vectors. [sent-18, score-0.092]

10 The family of kernel functions deﬁned on feature vectors computed from strings, are called string kernels (Watkins, 2000; Haussler, 1999). [sent-22, score-0.266]

11 These kernels are based on features corresponding to occurrences of certain kinds of subsequences in the string. [sent-23, score-0.328]

12 There is a wide variety of string kernels depending on how the subsequences are deﬁned: they can be contiguous or non-contiguous, they c 2005 Juho Rousu and John Shawe-Taylor. [sent-24, score-0.387]

13 ROUSU AND S HAWE -TAYLOR can have bounded or unbounded length, and the mismatches or gaps can be penalized in different ways. [sent-25, score-0.394]

14 There are three main approaches in computing string kernels efﬁciently. [sent-26, score-0.174]

15 , 2003) are based on composing the solution from simpler subproblems, in this case, from kernel values of shorter subsequences and preﬁxes of the two strings. [sent-29, score-0.23]

16 These approaches usually have time complexity of order Ω(p|s||t|) since one typically needs to compute intermediate results for each character pair si ,t j for each subsequence length 1 ≤ l ≤ p. [sent-30, score-0.229]

17 , 2003; Leslie and Kuang, 2003) one makes a depth-ﬁrst traversal to an implicit trie data structure. [sent-33, score-0.181]

18 The search continues along each trie path while in both of the strings there exist an occurrence of the p-gram corresponding to the trie node. [sent-34, score-0.708]

19 This termination condition prunes the search space very efﬁciently if the number of gaps is restricted enough. [sent-35, score-0.417]

20 The third approach is to build a sufﬁx tree of one of the strings and then compute matching statistics of the other string by traversing the sufﬁx tree to compute matching statistics (Vishwanathan and Smola, 2003). [sent-36, score-0.492]

21 However, the approach does not deal with gapped strings. [sent-38, score-0.163]

22 In this paper, we concentrate on improving the time-efﬁciency of the dynamic programming approach to gapped string kernel computation. [sent-39, score-0.537]

23 In Section 2 we review types of kernels that are used in text categorization and sequence analysis tasks. [sent-40, score-0.14]

24 In Seco tion 3, we review trie-based and a dynamic programming approaches for gap-weighted string kernel computation before presenting the main contribution of this article, a sparse dynamic programming algorithm for efﬁciently computing the kernel on large alphabets. [sent-42, score-0.619]

25 In Section 4 the new algorithm is experimentally compared against the full dynamic programming approach and a trie-based algorithm. [sent-44, score-0.198]

26 For sequences the most common feature representation is to count or check the existence of subsequence occurrences, when the subsequences are taken from a ﬁxed index set U. [sent-57, score-0.487]

27 Different choices for the index set and accounting for occurrences give rise to a family of feature representations and kernels. [sent-58, score-0.155]

28 Below we review the main forms of representation for sequences and the computation kernels for such representations. [sent-59, score-0.105]

29 In the most widely used feature representation for strings in a natural language, informally called bag-of-words (BoW), the index set is taken as the set of words in the language, possibly excluding some frequently occurring stop words (Salton et al. [sent-61, score-0.382]

30 In the case of a string s containing English text, for each English word u, we deﬁne the feature value φu (s) = |{ j|s j . [sent-64, score-0.241]

31 For the example text s = ’The cat was chased by the fat dog’ the BoW will contain the following non-zero entries: φthe (s) = 2, φdog (s) = 1, φwas (s) = 1, φchased (s) = 1, φby (s) = 1, φfat (s) = 1, φcat = 1. [sent-68, score-0.723]

32 These occurrence counts can also be weighted, for example by scaling by the inverse document frequency (TFIDF, Salton & Young, 1973): φu (s) = |{ j|s j . [sent-69, score-0.102]

33 For strings that do not encompass a crisply deﬁned word-structure, for example, biological sequences, a different approach is more suitable. [sent-75, score-0.257]

34 Given an alphabet Σ, a simple choice is to take U = Σ p , the set of strings of length p. [sent-76, score-0.336]

35 For example, if we choose p = 4 resulting feature values for our example text include φthe = 2, φ the = 1, φ cat = 1, φ dog = 1, along with close to thirty additional 4-grams. [sent-78, score-0.856]

36 There is a two-fold difﬁculty in focusing in ﬁxed length subsequences: Firstly, one may not know how to best choose the length p. [sent-79, score-0.112]

37 Secondly, there maybe important subsequences of different lengths in the sequences. [sent-80, score-0.22]

38 This problem can be circumvented by allowing the lengths of the 1325 ROUSU AND S HAWE -TAYLOR subsequences vary within a range: U = Σq ∪ Σq+1 · · · ∪ Σ p for some 1 ≤ q ≤ p. [sent-81, score-0.22]

39 (2) We call the resulting kernel the bounded-length substring kernel. [sent-82, score-0.181]

40 In the extreme case, we can take p = 1 and q = ∞, thus including in the index set all non-empty sequences of alphabet Σ. [sent-84, score-0.127]

41 Secondly, if all important subsequences have length at least some q0 , setting q < q0 will probably make the spectrum more ’noisy’. [sent-86, score-0.274]

42 The most efﬁcient approaches, working in O(p(|s| + |t|)) time, to compute substring spectrum kernels are based on sufﬁx trees (Leslie et al. [sent-89, score-0.214]

43 , 2002; Vishwanathan and Smola, 2003), although dynamic programming and trie-based approaches also can be used. [sent-90, score-0.198]

44 Another way to add ﬂexibility to our feature representation is to allow gaps in the subsequence occurrences. [sent-92, score-0.609]

45 , jq ), j1 < j2 < · · · < jq and denote s(j) = s j1 . [sent-97, score-0.212]

46 We deﬁne the features φu (s) to count the number of such unique sequences of indices j that the corresponding subsequence s j1 . [sent-101, score-0.264]

47 s jq equals to u, in other words φu (s) = |{j|s(j) = u}|. [sent-104, score-0.137]

48 Our example string ’The cat was chased by the fat dog’ can be seen to contain, among others, the following gapped substrings of length 3: φtea = 7, φted = 5, φdog = 2. [sent-105, score-1.183]

49 This deﬁnition does not make a difference between occurrences that contain few gaps and those that contain several gaps, or the lengths thereof; all contribute to the feature value equally. [sent-106, score-0.555]

50 For example, the substring ’tea’ will have a high weight as it occurs many times in the text, although it never occurs with fewer than two gaps and the total length of gaps is at least three. [sent-107, score-0.978]

51 A solution for this problem is to downweight occurrences that contain many or long gaps. [sent-109, score-0.115]

52 Such feature representation is the basis of gap-weighted string kernels. [sent-110, score-0.197]

53 In string matching, there are many approaches for weighting gaps (see e. [sent-111, score-0.523]

54 We consider two gapweighting schemes, both of which downweight occurrences exponentially in increasing gap number or length. [sent-115, score-0.162]

55 When downweighting by the total length of gaps the weight of an occurrence i = (i1 , . [sent-116, score-0.552]

56 , iq ) with span span(i) = iq − i1 + 1 is deﬁned as λspan(i) , where 0 < λ ≤ 1 is a ﬁxed penalty constant. [sent-119, score-0.198]

57 The feature values are then deﬁned as a normalized sum of occurrence weights φu (s) = 1/λq ∑ λspan(i) . [sent-120, score-0.147]

58 This normalization is important when using substrings of different lengths as the index set U , otherwise short substrings easily get too much weight. [sent-122, score-0.43]

59 In some applications the actual length of the gaps may not be important but the number of contiguous substrings that compose the gapped subsequence may be more relevant. [sent-123, score-0.971]

60 In literature, two approaches for computing gapped substring kernels appear, a dynamic programming approach (Lodhi et al. [sent-125, score-0.54]

61 We review the dynamic programming approach and a variant of the trie-based computation in Section 3, followed by a presentation of the new sparse dynamic programming approach. [sent-128, score-0.396]

62 We conclude this section by noting that treating text as strings of characters is not the only and not necessarily the best approach. [sent-130, score-0.32]

63 Using the text ’The cat was chased by the fat dog’ as the example, if words are used as the alphabet: • Substrings are word sequences, or phrases: ’cat was chased’. [sent-138, score-0.821]

64 • Gapped substrings will be phrases with some words skipped: ’cat ? [sent-139, score-0.32]

65 • Penalizing gaps will decrease the weight of phrases that span too long a text segment. [sent-144, score-0.647]

66 dog’ as the former phrase exists in the text as such whereas the latter contains two gaps of total length four. [sent-149, score-0.519]

67 Using phrases as features has a potential advantage over the bag-of-words representation, as the ordering and the proximity of the words is taken into account. [sent-150, score-0.16]

68 There is, however, one drawback in using words or phrases as the features, namely the slight variations in the word occurrences that still correspond to the same meaning. [sent-152, score-0.306]

69 Such variations include alternative spellings, preﬁxes/sufﬁxes attached to words or word stems. [sent-153, score-0.119]

70 An alternative approach is the use of syllable alphabet: the text is treated as a sequence of syllables: ’The cat was cha sed by the fat dog’. [sent-155, score-0.659]

71 Compared to the character alphabet, word and syllable alphabets share two beneﬁts. [sent-160, score-0.205]

72 Firstly, as argued above, using phrases of words or syllables are more likely to capture meaning in the text than arbitrary substring of characters. [sent-161, score-0.396]

73 Secondly, as the document size drastically goes down when moving from character to syllable and word alphabets, computational requirements decrease as well. [sent-162, score-0.138]

74 The ﬁrst is based on constructing an implicit trie data structure for the co-occurring gapped substrings in s and t. [sent-169, score-0.525]

75 The second algorithm is the dynamic programming approach by Lodhi et al. [sent-170, score-0.198]

76 , (2002), and the third is a new method based on sparse dynamic programming. [sent-171, score-0.117]

77 As a running example we we use two strings s = ’The cat was chased by the fat dog’ and t = ’The fat cat bit the dog’ and, for illustration, we apply the word alphabet. [sent-172, score-1.397]

78 The search composes an implicit trie-structure of matching subsequences in the two strings: each path from root to a node corresponds to a subsequence that co-occurs in the two strings, in one or more locations, with number of gaps at most given integer gmax . [sent-176, score-0.909]

79 The number of gaps need to be restricted in order to keep computation efﬁcient. [sent-177, score-0.394]

80 1328 G APPED S UBSTRING K ERNELS ON L ARGE A LPHABETS Thus the resulting kernel can be considered as an approximation of (3) where co-occurrences with mor than gmax gaps are discarded. [sent-178, score-0.547]

81 Figure 1 shows the trie structure of co-occurring subsequences for the example strings, when the index set is ﬁxed to U = Σ3 , the three word phrases. [sent-179, score-0.462]

82 Below we brieﬂy describe a trie-based algorithm that is a slight variation of the one described by Shawe-Taylor and Cristianini (2004) and also bear resemblance to the related mismatch string kernel algorithm by Leslie et al. [sent-180, score-0.203]

83 In a trie-node corresponding to subsequence u = u1 · · · uq the algorithm keeps track of all matches s(i) = u that could still be extended further. [sent-182, score-0.286]

84 iq , the last index iq is stored in a list of alive matches As (u, g) where g = span(i) − q is the number of gaps in the match. [sent-186, score-0.665]

85 To expand a node u, the algorithm looks for all possible letters the matching subsequence can be extended to a longer match uc. [sent-188, score-0.224]

86 For an alive match i ∈ As (u, g) and all g ≤ gmax − g the algorithm puts the indices i + 1 + g into As (usi+1+g , g + g ). [sent-189, score-0.216]

87 The search is continued for nodes uc S S for which both g As (uc, g) and g At (uc, g) are non-empty, that is, there is at least one occurrence of uc in both s and t, with some number of gaps. [sent-191, score-0.209]

88 This makes the trie much sparser than the subsequence tries for either one of the strings alone would be. [sent-192, score-0.549]

89 To ﬁnd the alive indices for ’the dog’ the algorithm searches for the occurrence of ’dog’, in s from indices 2 and 7 onwards, and ﬁnds the occurrence in position 6, corresponding to an occurrence with 1 and 4 gaps, respectively. [sent-194, score-0.485]

90 When a node u in depth d is encountered one easily obtains the relevant terms in the kernel via computing the sum φu (s)φu (t) = ∑ λgs +d |As (u, gs )| · λgt +d |At (u, gt )|. [sent-196, score-0.115]

91 gs ,gt If a length-p subsequence kernel is being computed this computation only need to be performed in the leaves of the trie. [sent-197, score-0.224]

92 For bounded-length subsequence kernel, the computation needs to be performed in all trie nodes that are in depth q ≤ d ≤ p. [sent-198, score-0.347]

93 Note that the above approach differs from the (g, k)-gapped trie algorithm by Leslie and Kuang (2003) in two respects: First, the strings s and t are not broken into frames before the search but the algorithm maintains the lists As (u, g) to keep track of the subsequence occurrences. [sent-199, score-0.647]

94 Second, the algorithm keeps track of the number of gaps in the occurrences during the search. [sent-200, score-0.521]

95 This relieves us from the need to embark on dynamic programming search in the trie leaves to compute the values φu (s)φu (t). [sent-201, score-0.402]

96 Note that if no gaps are allowed we get the time complexity O(p(|s| + |t|) matching the sufﬁx tree approach. [sent-203, score-0.45]

97 2 Dynamic Programming Computation The basis of dynamic programming computation of the string kernel (3) is the following observation: if there is a co-occurrence of substring u1 . [sent-205, score-0.508]

98 uq that ends in positions k’th and l’th position of s and t, respectively, two conditions must be satisﬁed: 1. [sent-208, score-0.14]

99 uq−1 that ends in some position i < k in string s and some position j < l in string t. [sent-215, score-0.344]

100 Moreover, ignoring the normalization 1/λ2q , the co-occurrence weight can be computed by from the weight of the preﬁx co-occurrence by extending the subsequences with k − i and l − j letters, respectively: λspan([i1 . [sent-216, score-0.183]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dog', 0.469), ('gaps', 0.394), ('cat', 0.273), ('strings', 0.221), ('chased', 0.199), ('subsequences', 0.183), ('fat', 0.182), ('substrings', 0.181), ('trie', 0.181), ('gapped', 0.163), ('subsequence', 0.147), ('substring', 0.134), ('string', 0.129), ('dynamic', 0.117), ('phrases', 0.108), ('rousu', 0.108), ('gmax', 0.106), ('jq', 0.106), ('occurrence', 0.102), ('programming', 0.081), ('leslie', 0.079), ('occurrences', 0.079), ('span', 0.076), ('alive', 0.072), ('hawe', 0.072), ('text', 0.069), ('word', 0.067), ('alphabets', 0.067), ('iq', 0.061), ('alphabet', 0.059), ('length', 0.056), ('matching', 0.056), ('apped', 0.054), ('cha', 0.054), ('juho', 0.054), ('kuang', 0.054), ('lphabets', 0.054), ('syllables', 0.054), ('ubstring', 0.054), ('kernel', 0.047), ('lists', 0.047), ('gap', 0.047), ('matches', 0.046), ('bow', 0.045), ('lodhi', 0.045), ('arge', 0.045), ('syllable', 0.045), ('uq', 0.045), ('english', 0.045), ('pre', 0.045), ('kernels', 0.045), ('feature', 0.045), ('uc', 0.042), ('positions', 0.04), ('indices', 0.038), ('lengths', 0.037), ('sequences', 0.037), ('cancedda', 0.036), ('downweight', 0.036), ('encompass', 0.036), ('sed', 0.036), ('tea', 0.036), ('cristianini', 0.035), ('spectrum', 0.035), ('ernels', 0.034), ('position', 0.031), ('words', 0.031), ('index', 0.031), ('xes', 0.03), ('contiguous', 0.03), ('penalizing', 0.03), ('southampton', 0.03), ('traversing', 0.03), ('characters', 0.03), ('gs', 0.03), ('article', 0.03), ('track', 0.028), ('mismatch', 0.027), ('salton', 0.027), ('categorization', 0.026), ('character', 0.026), ('tl', 0.024), ('firstly', 0.024), ('ends', 0.024), ('search', 0.023), ('representation', 0.023), ('secondly', 0.023), ('cu', 0.023), ('vishwanathan', 0.023), ('features', 0.021), ('kingdom', 0.021), ('letters', 0.021), ('variations', 0.021), ('count', 0.021), ('hyperplane', 0.02), ('keeps', 0.02), ('th', 0.019), ('gt', 0.019), ('depth', 0.019), ('letter', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets

Author: Juho Rousu, John Shawe-Taylor

2 0.045295462 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun

Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing ﬂexible and powerful input representations, this paper addresses the complementary issue of designing classiﬁcation algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classiﬁcation problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.

3 0.042792089 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

Author: Rie Kubota Ando, Tong Zhang

Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their eﬀectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Speciﬁcally we consider learning predictive structures on hypothesis spaces (that is, what kind of classiﬁers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the eﬀectiveness of the proposed algorithms in the semi-supervised learning setting.

4 0.040100213 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

Abstract: We present a family of positive deﬁnite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities deﬁned on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efﬁcient and ﬂexible framework using the “kernel trick”. Finally, a constructive approach to such positive deﬁnite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classiﬁcation to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space

5 0.034669317 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we deﬁne is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Speciﬁc kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can signiﬁcantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms

6 0.033309493 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

7 0.028809816 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

8 0.026689965 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines

9 0.023913726 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

10 0.023828659 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

11 0.022755804 8 jmlr-2005-Active Coevolutionary Learning of Deterministic Finite Automata

12 0.022619721 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

13 0.022440024 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

14 0.021351436 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

15 0.020603588 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

16 0.019154686 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

17 0.018147925 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

18 0.015790384 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

19 0.015558506 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

20 0.014372871 36 jmlr-2005-Gaussian Processes for Ordinal Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.102), (1, 0.055), (2, 0.079), (3, -0.006), (4, 0.064), (5, -0.019), (6, -0.028), (7, 0.003), (8, -0.036), (9, -0.03), (10, 0.003), (11, -0.062), (12, 0.122), (13, 0.2), (14, -0.145), (15, -0.117), (16, 0.031), (17, 0.046), (18, 0.034), (19, 0.318), (20, 0.291), (21, 0.117), (22, -0.155), (23, -0.078), (24, 0.115), (25, 0.101), (26, -0.054), (27, -0.078), (28, -0.202), (29, 0.316), (30, 0.387), (31, 0.081), (32, 0.102), (33, -0.129), (34, 0.21), (35, -0.171), (36, 0.107), (37, -0.109), (38, -0.137), (39, 0.009), (40, -0.189), (41, 0.024), (42, 0.08), (43, -0.152), (44, -0.132), (45, 0.036), (46, -0.112), (47, 0.055), (48, 0.175), (49, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97607738 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets

Author: Juho Rousu, John Shawe-Taylor

2 0.1502298 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

Author: Rie Kubota Ando, Tong Zhang

3 0.12598647 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun

4 0.11718532 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

5 0.098912954 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

Author: John Lafferty, Guy Lebanon

Abstract: A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold deﬁned by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classiﬁcation, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classiﬁcation. Keywords: kernels, heat equation, diffusion, information geometry, text classiﬁcation

6 0.092853457 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines

7 0.092649795 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

8 0.092145249 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

9 0.091975346 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

10 0.089185894 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

11 0.085771725 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

12 0.070953891 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

13 0.069001734 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

14 0.067481115 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

15 0.064849906 41 jmlr-2005-Kernel Methods for Measuring Independence

16 0.062180631 48 jmlr-2005-Learning the Kernel Function via Regularization

17 0.059694517 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

18 0.056016747 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

19 0.051590901 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

20 0.051184088 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.03), (19, 0.023), (36, 0.029), (37, 0.022), (42, 0.027), (43, 0.032), (47, 0.024), (52, 0.084), (59, 0.016), (70, 0.021), (88, 0.042), (90, 0.012), (93, 0.537), (94, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81232083 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets

Author: Juho Rousu, John Shawe-Taylor

2 0.22526085 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

3 0.20212394 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou

Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efﬁcient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We ﬁrst present an online SVM algorithm based on this premise. LASVM yields competitive misclassiﬁcation rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.

4 0.20020075 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton

Author: Tong Luo, Kurt Kramer, Dmitry B. Goldgof, Lawrence O. Hall, Scott Samson, Andrew Remsen, Thomas Hopkins

Abstract: This paper presents an active learning method which reduces the labeling effort of domain experts in multi-class classiﬁcation problems. Active learning is applied in conjunction with support vector machines to recognize underwater zooplankton from higher-resolution, new generation SIPPER II images. Most previous work on active learning with support vector machines only deals with two class problems. In this paper, we propose an active learning approach “breaking ties” for multiclass support vector machines using the one-vs-one approach with a probability approximation. Experimental results indicate that our approach often requires signiﬁcantly less labeled images to reach a given accuracy than the approach of labeling the least certain test example and random sampling. It can also be applied in batch mode resulting in an accuracy comparable to labeling one image at a time and retraining. Keywords: active learning, support vector machine, plankton recognition, probabilistic output, multi-class support vector machine

5 0.19124125 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

Author: Ivor W. Tsang, James T. Kwok, Pak-Ming Cheung

Abstract: O(m3 ) Standard SVM training has time and O(m2 ) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness” in this paper. We ﬁrst show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB) problems in computational geometry. Then, by adopting an efﬁcient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine (CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and realworld data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about ﬁve million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC. Keywords: kernel methods, approximation algorithm, minimum enclosing ball, core set, scalability

6 0.19049461 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints

7 0.18839385 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

8 0.18809703 36 jmlr-2005-Gaussian Processes for Ordinal Regression

9 0.1874316 12 jmlr-2005-An MDP-Based Recommender System

10 0.18712591 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

11 0.18532287 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

12 0.18349497 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

13 0.18160623 71 jmlr-2005-Variational Message Passing

14 0.18001005 3 jmlr-2005-A Classification Framework for Anomaly Detection

15 0.17923254 39 jmlr-2005-Information Bottleneck for Gaussian Variables

16 0.17914677 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

17 0.17744187 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

18 0.17718245 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

19 0.17659755 54 jmlr-2005-Managing Diversity in Regression Ensembles

20 0.17652234 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching