jmlr jmlr2005 jmlr2005-28 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Juho Rousu, John Shawe-Taylor
Abstract: We present a sparse dynamic programming algorithm that, given two strings s and t, a gap penalty λ, and an integer p, computes the value of the gap-weighted length-p subsequences kernel. The algorithm works in time O(p|M| log |t|), where M = {(i, j)|si = t j } is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-specific match/gap penalties. The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted. Keywords: kernel methods, string kernels, text categorization, sparse dynamic programming
Reference: text
sentIndex sentText sentNum sentScore
1 The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-specific match/gap penalties. [sent-9, score-1.113]
2 The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. [sent-10, score-0.228]
3 Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. [sent-11, score-0.478]
4 On large alphabets, the new sparse dynamic programming algorithm is the most efficient. [sent-12, score-0.198]
5 On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted. [sent-13, score-0.461]
6 Keywords: kernel methods, string kernels, text categorization, sparse dynamic programming 1. [sent-14, score-0.443]
7 Introduction Machine learning algorithms working on sequence data are needed both in bioinformatics and text categorization and mining. [sent-15, score-0.095]
8 In contrast, standard machine learning algorithms work on feature vector representation, thus requiring a feature extraction phase to map sequence data into feature vectors. [sent-16, score-0.135]
9 Kernel methods (Vapnik, 1995; Cristianini and Shawe-Taylor, 2000) provide an efficient way of tackling the problem of dimensionality via the use of a kernel function, corresponding to the inner product of two feature vectors. [sent-18, score-0.092]
10 The family of kernel functions defined on feature vectors computed from strings, are called string kernels (Watkins, 2000; Haussler, 1999). [sent-22, score-0.266]
11 These kernels are based on features corresponding to occurrences of certain kinds of subsequences in the string. [sent-23, score-0.328]
12 There is a wide variety of string kernels depending on how the subsequences are defined: they can be contiguous or non-contiguous, they c 2005 Juho Rousu and John Shawe-Taylor. [sent-24, score-0.387]
13 ROUSU AND S HAWE -TAYLOR can have bounded or unbounded length, and the mismatches or gaps can be penalized in different ways. [sent-25, score-0.394]
14 There are three main approaches in computing string kernels efficiently. [sent-26, score-0.174]
15 , 2003) are based on composing the solution from simpler subproblems, in this case, from kernel values of shorter subsequences and prefixes of the two strings. [sent-29, score-0.23]
16 These approaches usually have time complexity of order Ω(p|s||t|) since one typically needs to compute intermediate results for each character pair si ,t j for each subsequence length 1 ≤ l ≤ p. [sent-30, score-0.229]
17 , 2003; Leslie and Kuang, 2003) one makes a depth-first traversal to an implicit trie data structure. [sent-33, score-0.181]
18 The search continues along each trie path while in both of the strings there exist an occurrence of the p-gram corresponding to the trie node. [sent-34, score-0.708]
19 This termination condition prunes the search space very efficiently if the number of gaps is restricted enough. [sent-35, score-0.417]
20 The third approach is to build a suffix tree of one of the strings and then compute matching statistics of the other string by traversing the suffix tree to compute matching statistics (Vishwanathan and Smola, 2003). [sent-36, score-0.492]
21 However, the approach does not deal with gapped strings. [sent-38, score-0.163]
22 In this paper, we concentrate on improving the time-efficiency of the dynamic programming approach to gapped string kernel computation. [sent-39, score-0.537]
23 In Section 2 we review types of kernels that are used in text categorization and sequence analysis tasks. [sent-40, score-0.14]
24 In Seco tion 3, we review trie-based and a dynamic programming approaches for gap-weighted string kernel computation before presenting the main contribution of this article, a sparse dynamic programming algorithm for efficiently computing the kernel on large alphabets. [sent-42, score-0.619]
25 In Section 4 the new algorithm is experimentally compared against the full dynamic programming approach and a trie-based algorithm. [sent-44, score-0.198]
26 For sequences the most common feature representation is to count or check the existence of subsequence occurrences, when the subsequences are taken from a fixed index set U. [sent-57, score-0.487]
27 Different choices for the index set and accounting for occurrences give rise to a family of feature representations and kernels. [sent-58, score-0.155]
28 Below we review the main forms of representation for sequences and the computation kernels for such representations. [sent-59, score-0.105]
29 In the most widely used feature representation for strings in a natural language, informally called bag-of-words (BoW), the index set is taken as the set of words in the language, possibly excluding some frequently occurring stop words (Salton et al. [sent-61, score-0.382]
30 In the case of a string s containing English text, for each English word u, we define the feature value φu (s) = |{ j|s j . [sent-64, score-0.241]
31 For the example text s = ’The cat was chased by the fat dog’ the BoW will contain the following non-zero entries: φthe (s) = 2, φdog (s) = 1, φwas (s) = 1, φchased (s) = 1, φby (s) = 1, φfat (s) = 1, φcat = 1. [sent-68, score-0.723]
32 These occurrence counts can also be weighted, for example by scaling by the inverse document frequency (TFIDF, Salton & Young, 1973): φu (s) = |{ j|s j . [sent-69, score-0.102]
33 For strings that do not encompass a crisply defined word-structure, for example, biological sequences, a different approach is more suitable. [sent-75, score-0.257]
34 Given an alphabet Σ, a simple choice is to take U = Σ p , the set of strings of length p. [sent-76, score-0.336]
35 For example, if we choose p = 4 resulting feature values for our example text include φthe = 2, φ the = 1, φ cat = 1, φ dog = 1, along with close to thirty additional 4-grams. [sent-78, score-0.856]
36 There is a two-fold difficulty in focusing in fixed length subsequences: Firstly, one may not know how to best choose the length p. [sent-79, score-0.112]
37 Secondly, there maybe important subsequences of different lengths in the sequences. [sent-80, score-0.22]
38 This problem can be circumvented by allowing the lengths of the 1325 ROUSU AND S HAWE -TAYLOR subsequences vary within a range: U = Σq ∪ Σq+1 · · · ∪ Σ p for some 1 ≤ q ≤ p. [sent-81, score-0.22]
39 (2) We call the resulting kernel the bounded-length substring kernel. [sent-82, score-0.181]
40 In the extreme case, we can take p = 1 and q = ∞, thus including in the index set all non-empty sequences of alphabet Σ. [sent-84, score-0.127]
41 Secondly, if all important subsequences have length at least some q0 , setting q < q0 will probably make the spectrum more ’noisy’. [sent-86, score-0.274]
42 The most efficient approaches, working in O(p(|s| + |t|)) time, to compute substring spectrum kernels are based on suffix trees (Leslie et al. [sent-89, score-0.214]
43 , 2002; Vishwanathan and Smola, 2003), although dynamic programming and trie-based approaches also can be used. [sent-90, score-0.198]
44 Another way to add flexibility to our feature representation is to allow gaps in the subsequence occurrences. [sent-92, score-0.609]
45 , jq ), j1 < j2 < · · · < jq and denote s(j) = s j1 . [sent-97, score-0.212]
46 We define the features φu (s) to count the number of such unique sequences of indices j that the corresponding subsequence s j1 . [sent-101, score-0.264]
47 s jq equals to u, in other words φu (s) = |{j|s(j) = u}|. [sent-104, score-0.137]
48 Our example string ’The cat was chased by the fat dog’ can be seen to contain, among others, the following gapped substrings of length 3: φtea = 7, φted = 5, φdog = 2. [sent-105, score-1.183]
49 This definition does not make a difference between occurrences that contain few gaps and those that contain several gaps, or the lengths thereof; all contribute to the feature value equally. [sent-106, score-0.555]
50 For example, the substring ’tea’ will have a high weight as it occurs many times in the text, although it never occurs with fewer than two gaps and the total length of gaps is at least three. [sent-107, score-0.978]
51 A solution for this problem is to downweight occurrences that contain many or long gaps. [sent-109, score-0.115]
52 Such feature representation is the basis of gap-weighted string kernels. [sent-110, score-0.197]
53 In string matching, there are many approaches for weighting gaps (see e. [sent-111, score-0.523]
54 We consider two gapweighting schemes, both of which downweight occurrences exponentially in increasing gap number or length. [sent-115, score-0.162]
55 When downweighting by the total length of gaps the weight of an occurrence i = (i1 , . [sent-116, score-0.552]
56 , iq ) with span span(i) = iq − i1 + 1 is defined as λspan(i) , where 0 < λ ≤ 1 is a fixed penalty constant. [sent-119, score-0.198]
57 The feature values are then defined as a normalized sum of occurrence weights φu (s) = 1/λq ∑ λspan(i) . [sent-120, score-0.147]
58 This normalization is important when using substrings of different lengths as the index set U , otherwise short substrings easily get too much weight. [sent-122, score-0.43]
59 In some applications the actual length of the gaps may not be important but the number of contiguous substrings that compose the gapped subsequence may be more relevant. [sent-123, score-0.971]
60 In literature, two approaches for computing gapped substring kernels appear, a dynamic programming approach (Lodhi et al. [sent-125, score-0.54]
61 We review the dynamic programming approach and a variant of the trie-based computation in Section 3, followed by a presentation of the new sparse dynamic programming approach. [sent-128, score-0.396]
62 We conclude this section by noting that treating text as strings of characters is not the only and not necessarily the best approach. [sent-130, score-0.32]
63 Using the text ’The cat was chased by the fat dog’ as the example, if words are used as the alphabet: • Substrings are word sequences, or phrases: ’cat was chased’. [sent-138, score-0.821]
64 • Gapped substrings will be phrases with some words skipped: ’cat ? [sent-139, score-0.32]
65 • Penalizing gaps will decrease the weight of phrases that span too long a text segment. [sent-144, score-0.647]
66 dog’ as the former phrase exists in the text as such whereas the latter contains two gaps of total length four. [sent-149, score-0.519]
67 Using phrases as features has a potential advantage over the bag-of-words representation, as the ordering and the proximity of the words is taken into account. [sent-150, score-0.16]
68 There is, however, one drawback in using words or phrases as the features, namely the slight variations in the word occurrences that still correspond to the same meaning. [sent-152, score-0.306]
69 Such variations include alternative spellings, prefixes/suffixes attached to words or word stems. [sent-153, score-0.119]
70 An alternative approach is the use of syllable alphabet: the text is treated as a sequence of syllables: ’The cat was cha sed by the fat dog’. [sent-155, score-0.659]
71 Compared to the character alphabet, word and syllable alphabets share two benefits. [sent-160, score-0.205]
72 Firstly, as argued above, using phrases of words or syllables are more likely to capture meaning in the text than arbitrary substring of characters. [sent-161, score-0.396]
73 Secondly, as the document size drastically goes down when moving from character to syllable and word alphabets, computational requirements decrease as well. [sent-162, score-0.138]
74 The first is based on constructing an implicit trie data structure for the co-occurring gapped substrings in s and t. [sent-169, score-0.525]
75 The second algorithm is the dynamic programming approach by Lodhi et al. [sent-170, score-0.198]
76 , (2002), and the third is a new method based on sparse dynamic programming. [sent-171, score-0.117]
77 As a running example we we use two strings s = ’The cat was chased by the fat dog’ and t = ’The fat cat bit the dog’ and, for illustration, we apply the word alphabet. [sent-172, score-1.397]
78 The search composes an implicit trie-structure of matching subsequences in the two strings: each path from root to a node corresponds to a subsequence that co-occurs in the two strings, in one or more locations, with number of gaps at most given integer gmax . [sent-176, score-0.909]
79 The number of gaps need to be restricted in order to keep computation efficient. [sent-177, score-0.394]
80 1328 G APPED S UBSTRING K ERNELS ON L ARGE A LPHABETS Thus the resulting kernel can be considered as an approximation of (3) where co-occurrences with mor than gmax gaps are discarded. [sent-178, score-0.547]
81 Figure 1 shows the trie structure of co-occurring subsequences for the example strings, when the index set is fixed to U = Σ3 , the three word phrases. [sent-179, score-0.462]
82 Below we briefly describe a trie-based algorithm that is a slight variation of the one described by Shawe-Taylor and Cristianini (2004) and also bear resemblance to the related mismatch string kernel algorithm by Leslie et al. [sent-180, score-0.203]
83 In a trie-node corresponding to subsequence u = u1 · · · uq the algorithm keeps track of all matches s(i) = u that could still be extended further. [sent-182, score-0.286]
84 iq , the last index iq is stored in a list of alive matches As (u, g) where g = span(i) − q is the number of gaps in the match. [sent-186, score-0.665]
85 To expand a node u, the algorithm looks for all possible letters the matching subsequence can be extended to a longer match uc. [sent-188, score-0.224]
86 For an alive match i ∈ As (u, g) and all g ≤ gmax − g the algorithm puts the indices i + 1 + g into As (usi+1+g , g + g ). [sent-189, score-0.216]
87 The search is continued for nodes uc S S for which both g As (uc, g) and g At (uc, g) are non-empty, that is, there is at least one occurrence of uc in both s and t, with some number of gaps. [sent-191, score-0.209]
88 This makes the trie much sparser than the subsequence tries for either one of the strings alone would be. [sent-192, score-0.549]
89 To find the alive indices for ’the dog’ the algorithm searches for the occurrence of ’dog’, in s from indices 2 and 7 onwards, and finds the occurrence in position 6, corresponding to an occurrence with 1 and 4 gaps, respectively. [sent-194, score-0.485]
90 When a node u in depth d is encountered one easily obtains the relevant terms in the kernel via computing the sum φu (s)φu (t) = ∑ λgs +d |As (u, gs )| · λgt +d |At (u, gt )|. [sent-196, score-0.115]
91 gs ,gt If a length-p subsequence kernel is being computed this computation only need to be performed in the leaves of the trie. [sent-197, score-0.224]
92 For bounded-length subsequence kernel, the computation needs to be performed in all trie nodes that are in depth q ≤ d ≤ p. [sent-198, score-0.347]
93 Note that the above approach differs from the (g, k)-gapped trie algorithm by Leslie and Kuang (2003) in two respects: First, the strings s and t are not broken into frames before the search but the algorithm maintains the lists As (u, g) to keep track of the subsequence occurrences. [sent-199, score-0.647]
94 Second, the algorithm keeps track of the number of gaps in the occurrences during the search. [sent-200, score-0.521]
95 This relieves us from the need to embark on dynamic programming search in the trie leaves to compute the values φu (s)φu (t). [sent-201, score-0.402]
96 Note that if no gaps are allowed we get the time complexity O(p(|s| + |t|) matching the suffix tree approach. [sent-203, score-0.45]
97 2 Dynamic Programming Computation The basis of dynamic programming computation of the string kernel (3) is the following observation: if there is a co-occurrence of substring u1 . [sent-205, score-0.508]
98 uq that ends in positions k’th and l’th position of s and t, respectively, two conditions must be satisfied: 1. [sent-208, score-0.14]
99 uq−1 that ends in some position i < k in string s and some position j < l in string t. [sent-215, score-0.344]
100 Moreover, ignoring the normalization 1/λ2q , the co-occurrence weight can be computed by from the weight of the prefix co-occurrence by extending the subsequences with k − i and l − j letters, respectively: λspan([i1 . [sent-216, score-0.183]
wordName wordTfidf (topN-words)
[('dog', 0.469), ('gaps', 0.394), ('cat', 0.273), ('strings', 0.221), ('chased', 0.199), ('subsequences', 0.183), ('fat', 0.182), ('substrings', 0.181), ('trie', 0.181), ('gapped', 0.163), ('subsequence', 0.147), ('substring', 0.134), ('string', 0.129), ('dynamic', 0.117), ('phrases', 0.108), ('rousu', 0.108), ('gmax', 0.106), ('jq', 0.106), ('occurrence', 0.102), ('programming', 0.081), ('leslie', 0.079), ('occurrences', 0.079), ('span', 0.076), ('alive', 0.072), ('hawe', 0.072), ('text', 0.069), ('word', 0.067), ('alphabets', 0.067), ('iq', 0.061), ('alphabet', 0.059), ('length', 0.056), ('matching', 0.056), ('apped', 0.054), ('cha', 0.054), ('juho', 0.054), ('kuang', 0.054), ('lphabets', 0.054), ('syllables', 0.054), ('ubstring', 0.054), ('kernel', 0.047), ('lists', 0.047), ('gap', 0.047), ('matches', 0.046), ('bow', 0.045), ('lodhi', 0.045), ('arge', 0.045), ('syllable', 0.045), ('uq', 0.045), ('english', 0.045), ('pre', 0.045), ('kernels', 0.045), ('feature', 0.045), ('uc', 0.042), ('positions', 0.04), ('indices', 0.038), ('lengths', 0.037), ('sequences', 0.037), ('cancedda', 0.036), ('downweight', 0.036), ('encompass', 0.036), ('sed', 0.036), ('tea', 0.036), ('cristianini', 0.035), ('spectrum', 0.035), ('ernels', 0.034), ('position', 0.031), ('words', 0.031), ('index', 0.031), ('xes', 0.03), ('contiguous', 0.03), ('penalizing', 0.03), ('southampton', 0.03), ('traversing', 0.03), ('characters', 0.03), ('gs', 0.03), ('article', 0.03), ('track', 0.028), ('mismatch', 0.027), ('salton', 0.027), ('categorization', 0.026), ('character', 0.026), ('tl', 0.024), ('firstly', 0.024), ('ends', 0.024), ('search', 0.023), ('representation', 0.023), ('secondly', 0.023), ('cu', 0.023), ('vishwanathan', 0.023), ('features', 0.021), ('kingdom', 0.021), ('letters', 0.021), ('variations', 0.021), ('count', 0.021), ('hyperplane', 0.02), ('keeps', 0.02), ('th', 0.019), ('gt', 0.019), ('depth', 0.019), ('letter', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
Author: Juho Rousu, John Shawe-Taylor
Abstract: We present a sparse dynamic programming algorithm that, given two strings s and t, a gap penalty λ, and an integer p, computes the value of the gap-weighted length-p subsequences kernel. The algorithm works in time O(p|M| log |t|), where M = {(i, j)|si = t j } is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-specific match/gap penalties. The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted. Keywords: kernel methods, string kernels, text categorization, sparse dynamic programming
2 0.045295462 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
3 0.042792089 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
Author: Rie Kubota Ando, Tong Zhang
Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.
4 0.040100213 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
5 0.034669317 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms
6 0.033309493 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
7 0.028809816 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
8 0.026689965 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
9 0.023913726 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
10 0.023828659 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
11 0.022755804 8 jmlr-2005-Active Coevolutionary Learning of Deterministic Finite Automata
12 0.022619721 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
13 0.022440024 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
14 0.021351436 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
15 0.020603588 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
16 0.019154686 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
17 0.018147925 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
18 0.015790384 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
19 0.015558506 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
20 0.014372871 36 jmlr-2005-Gaussian Processes for Ordinal Regression
topicId topicWeight
[(0, 0.102), (1, 0.055), (2, 0.079), (3, -0.006), (4, 0.064), (5, -0.019), (6, -0.028), (7, 0.003), (8, -0.036), (9, -0.03), (10, 0.003), (11, -0.062), (12, 0.122), (13, 0.2), (14, -0.145), (15, -0.117), (16, 0.031), (17, 0.046), (18, 0.034), (19, 0.318), (20, 0.291), (21, 0.117), (22, -0.155), (23, -0.078), (24, 0.115), (25, 0.101), (26, -0.054), (27, -0.078), (28, -0.202), (29, 0.316), (30, 0.387), (31, 0.081), (32, 0.102), (33, -0.129), (34, 0.21), (35, -0.171), (36, 0.107), (37, -0.109), (38, -0.137), (39, 0.009), (40, -0.189), (41, 0.024), (42, 0.08), (43, -0.152), (44, -0.132), (45, 0.036), (46, -0.112), (47, 0.055), (48, 0.175), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.97607738 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
Author: Juho Rousu, John Shawe-Taylor
Abstract: We present a sparse dynamic programming algorithm that, given two strings s and t, a gap penalty λ, and an integer p, computes the value of the gap-weighted length-p subsequences kernel. The algorithm works in time O(p|M| log |t|), where M = {(i, j)|si = t j } is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-specific match/gap penalties. The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted. Keywords: kernel methods, string kernels, text categorization, sparse dynamic programming
2 0.1502298 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
Author: Rie Kubota Ando, Tong Zhang
Abstract: One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don’t have a complete understanding of their effectiveness. This paper investigates a closely related problem, which leads to a novel approach to semi-supervised learning. Specifically we consider learning predictive structures on hypothesis spaces (that is, what kind of classifiers have good predictive power) from multiple learning tasks. We present a general framework in which the structural learning problem can be formulated and analyzed theoretically, and relate it to learning with unlabeled data. Under this framework, algorithms for structural learning will be proposed, and computational issues will be investigated. Experiments will be given to demonstrate the effectiveness of the proposed algorithms in the semi-supervised learning setting.
3 0.12598647 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
4 0.11718532 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
5 0.098912954 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
Author: John Lafferty, Guy Lebanon
Abstract: A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification. Keywords: kernels, heat equation, diffusion, information geometry, text classification
6 0.092853457 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
7 0.092649795 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
8 0.092145249 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
9 0.091975346 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
10 0.089185894 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
11 0.085771725 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
12 0.070953891 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
13 0.069001734 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
14 0.067481115 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
15 0.064849906 41 jmlr-2005-Kernel Methods for Measuring Independence
16 0.062180631 48 jmlr-2005-Learning the Kernel Function via Regularization
17 0.059694517 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
18 0.056016747 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
19 0.051590901 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
20 0.051184088 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
topicId topicWeight
[(13, 0.03), (19, 0.023), (36, 0.029), (37, 0.022), (42, 0.027), (43, 0.032), (47, 0.024), (52, 0.084), (59, 0.016), (70, 0.021), (88, 0.042), (90, 0.012), (93, 0.537), (94, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.81232083 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
Author: Juho Rousu, John Shawe-Taylor
Abstract: We present a sparse dynamic programming algorithm that, given two strings s and t, a gap penalty λ, and an integer p, computes the value of the gap-weighted length-p subsequences kernel. The algorithm works in time O(p|M| log |t|), where M = {(i, j)|si = t j } is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-specific match/gap penalties. The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted. Keywords: kernel methods, string kernels, text categorization, sparse dynamic programming
2 0.22526085 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
3 0.20212394 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
4 0.20020075 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
Author: Tong Luo, Kurt Kramer, Dmitry B. Goldgof, Lawrence O. Hall, Scott Samson, Andrew Remsen, Thomas Hopkins
Abstract: This paper presents an active learning method which reduces the labeling effort of domain experts in multi-class classification problems. Active learning is applied in conjunction with support vector machines to recognize underwater zooplankton from higher-resolution, new generation SIPPER II images. Most previous work on active learning with support vector machines only deals with two class problems. In this paper, we propose an active learning approach “breaking ties” for multiclass support vector machines using the one-vs-one approach with a probability approximation. Experimental results indicate that our approach often requires significantly less labeled images to reach a given accuracy than the approach of labeling the least certain test example and random sampling. It can also be applied in batch mode resulting in an accuracy comparable to labeling one image at a time and retraining. Keywords: active learning, support vector machine, plankton recognition, probabilistic output, multi-class support vector machine
5 0.19124125 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
Author: Ivor W. Tsang, James T. Kwok, Pak-Ming Cheung
Abstract: O(m3 ) Standard SVM training has time and O(m2 ) space complexities, where m is the training set size. It is thus computationally infeasible on very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness” in this paper. We first show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB) problems in computational geometry. Then, by adopting an efficient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine (CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a space complexity that is independent of m. Experiments on large toy and realworld data sets demonstrate that the CVM is as accurate as existing SVM implementations, but is much faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about five million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC. Keywords: kernel methods, approximation algorithm, minimum enclosing ball, core set, scalability
6 0.19049461 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
7 0.18839385 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
8 0.18809703 36 jmlr-2005-Gaussian Processes for Ordinal Regression
9 0.1874316 12 jmlr-2005-An MDP-Based Recommender System
10 0.18712591 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
11 0.18532287 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
12 0.18349497 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
13 0.18160623 71 jmlr-2005-Variational Message Passing
14 0.18001005 3 jmlr-2005-A Classification Framework for Anomaly Detection
15 0.17923254 39 jmlr-2005-Information Bottleneck for Gaussian Variables
16 0.17914677 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
17 0.17744187 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
18 0.17718245 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
19 0.17659755 54 jmlr-2005-Managing Diversity in Regression Ensembles
20 0.17652234 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching