jmlr jmlr2008 jmlr2008-55 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Konrad Rieck, Pavel Laskov
Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. Keywords: string kernels, string distances, learning with sequential data
Reference: text
sentIndex sentText sentNum sentScore
1 In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. [sent-9, score-0.55]
2 The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. [sent-10, score-0.551]
3 As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. [sent-11, score-0.333]
4 The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. [sent-13, score-1.006]
5 Its generality is manifested by the ability to handle a large number of kernel functions, distances and non-metric similarity measures. [sent-61, score-0.342]
6 From considerations of efficiency, we focus on algorithms with linear-time asymptotic complexity in the sequence lengths—at the expense of narrowing the scope of similarity measures that can be handled. [sent-62, score-0.33]
7 The basis of our framework is embedding of sequences in a high-dimensional feature space using a formal language, a classical tool of computer science for modeling semantics of sequences. [sent-65, score-0.446]
8 A further advantage of embedding using formal languages is separation of embedding models from algorithms, which allows one to investigate different data structures to obtain optimal efficiency in practice. [sent-68, score-0.582]
9 Several data structures have been previously considered for specific similarity measures, such as hash tables (Damashek, 1995), sorted arrays (Sonnenburg et al. [sent-69, score-0.763]
10 Most of them are also suitable for the general framework developed in this paper; however, certain trade-offs exist between their asymptotic run-time complexity, implementation difficulty and restrictions on embedding languages they can handle. [sent-75, score-0.338]
11 To provide an insight into these issues, we propose and analyze three data structures suitable for our framework: sorted arrays, tries and suffix trees with an extension to suffix arrays. [sent-76, score-0.333]
12 The message of our analysis, supported by experimental evaluation, is that the choice of an optimal data structure depends on the embedding language to be used. [sent-77, score-0.35]
13 Related Work Assessing the similarity of two sequences is a classical problem of computer science. [sent-83, score-0.417]
14 However, similarity measures based on the Hamming distance are restricted to sequences of equal length and measures derived from the Levenshtein distance (e. [sent-87, score-0.724]
15 The idea of determining similarity of sequences by an inner-product was revived in kernel-based learning in the form of bag-of-words kernels (e. [sent-99, score-0.486]
16 Moreover, research in bioinformatics and text processing advanced the capabilities of string kernels, for example, by considering gaps, mismatches and positions in sequences (e. [sent-107, score-0.403]
17 The comparison framework proposed in this article shares the concept of embedding sequences with all of the above kernels, in fact most of the linear-time string kernels (e. [sent-113, score-0.681]
18 A further alternative for comparison of sequences are kernels derived from generative probability models, such as the Fisher kernel (Jaakkola et al. [sent-117, score-0.374]
19 The approach enables the design of highly specific similarity measures which exploit the rich structure of generative models, for example, for prediction of DNA splice sites (R¨ tsch and Sonnenburg, 2004). [sent-121, score-0.327]
20 1 Embedding Sequences using a Formal Language The basis for embedding of a sequence x is a formal language L, whose elements are sequences spanning an |L|-dimensional feature space. [sent-135, score-0.594]
21 We refer to L as the embedding language and to a sequence w ∈ L as a word of L. [sent-136, score-0.489]
22 i=1 Note that the alphabet A in the embedding languages may also be composed of higher semantic constructs, such as natural words or syntactic tokens. [sent-161, score-0.452]
23 Given an embedding language L, a sequence x can be mapped into the |L|-dimensional feature space by calculating a function φw (x) for every w ∈ L appearing in x. [sent-163, score-0.392]
24 The embedding function for a sequence x is given by : x → (φw (x))w∈L with φw (x) := occ(w, x) · Ww (1) where occ(w, x) is the number of occurrences of w in the sequence x and Ww a weighting assigned to individual words. [sent-164, score-0.377]
25 The weight Ww is based on the length |w| (see Shawe-Taylor and Cristianini, 2004; Vishwanathan and Smola, 2004), for example, so that longer words contribute more than shorter words to a similarity measure. [sent-177, score-0.368]
26 The introduced weighting schemes can be coupled to further refine the embedding based on L, for example, in text processing the impact of a particular term might be influenced by the term frequency, inverse document frequency and its length. [sent-187, score-0.334]
27 2 Vectorial Similarity Measures for Sequences With an embedding language L at hand, we can now express common vectorial similarity measures in the domain of sequences. [sent-189, score-0.638]
28 A further and rather exotic class of vectorial similarity measures are similarity coefficients (see Sokal and Sneath, 1963; Anderberg, 1973). [sent-193, score-0.503]
29 For application to non-binary vectors the summation variables a, b, c can be extended in terms of an embedding language L (Rieck et al. [sent-199, score-0.35]
30 The particular definitions of outer and inner functions for the presented similarity measures are given in Table 4. [sent-218, score-0.325]
31 For the Chebyshev distance the operator ⊗ represents the identity function, while for all other similarity measures it represents a multiplication. [sent-220, score-0.335]
32 As a last step towards the development of comparison algorithms, we need to address the high dimensionality of the feature space induced by the embedding language L. [sent-225, score-0.398]
33 We can now refine the unified form (3) by partitioning the similarity measures into conjunctive and disjunctive measures using an auxiliary function m: ˜ s(x, y) = w∈L m(φw (x), φw (y)). [sent-231, score-0.431]
34 By using a kernel to express similarity coefficients as shown in expression (2), similarity coefficients also exhibit the conjunctive property. [sent-237, score-0.555]
35 In particular, we present three approaches differing in capabilities and implementation complexity covering simple sorted arrays, tries and generalized suffix trees. [sent-247, score-0.332]
36 As an example running through this section we consider the two sequences x = abbaa and y = baaaab from the binary alphabet A = {a, b} and the embedding language of 3-grams, L = A3 . [sent-250, score-0.664]
37 A simple and intuitive representation for storage of embedded sequences are sorted arrays or alternatively sorted lists (Joachims, 2002; Rieck et al. [sent-254, score-0.933]
38 Given an embedding language L and a sequence x, all words w ∈ L satisfying w ⊑ x are maintained in an array X along with their embedding values φw (x). [sent-257, score-0.837]
39 Each field x of X consists of two attributes: the stored word word[x] and its embedding value phi[x]. [sent-258, score-0.341]
40 Figure 1 illustrates the sorted arrays of 3-grams extracted from the two example sequences x and y. [sent-261, score-0.709]
41 word[x] phi[x] X abb | 1 baa | 1 bba | 1 Y aaa | 2 aab | 1 baa | 1 Figure 1: Sorted arrays of 3-grams for x = abbaa and y = baaaab. [sent-262, score-0.463]
42 Comparison of two sorted arrays X and Y is carried out by looping over the fields of both arrays in the manner of merging sorted arrays (Knuth, 1973). [sent-265, score-1.354]
43 The comparison algorithm based on sorted arrays is simple to implement, yet it does not enable linear-time comparison for all embedding languages, for example, if L = A∗ . [sent-273, score-0.847]
44 However, sorted arrays enable linear-time similarity measures, if there exist O(|x|) words with w ⊑ x, which implies all w ∈ L have no or constant overlap in x. [sent-274, score-0.765]
45 Under these constraints a sorted array is extracted from a sequence x in O(k|x|) time using linear-time sorting, for example, radix sort (Knuth, 1973), where k is the maximum length of all words w ∈ L in x. [sent-276, score-0.477]
46 The simple design of the sorted array approach enables a very efficient extension. [sent-280, score-0.325]
47 Another extension for computation of conjunctive measures using sorted arrays has been proposed by Sonnenburg et al. [sent-287, score-0.65]
48 If two sequences x and y have unbalanced sizes |x| ≪ |y|, one loops over the shorter sorted array X and performs a binary search procedure on Y , in favor of processing both sorted arrays in parallel. [sent-289, score-1.034]
49 A trie node x contains a vector of size ¯ |A| linking to child nodes, a binary flag to indicate the end of a sequence mark[x] and an embedding value phi[x]. [sent-296, score-0.563]
50 A sequences x is embedded using a trie X by storing all w ∈ L with w ⊑ x and corresponding φw (x) in X (Shawe-Taylor and Cristianini, 2004). [sent-299, score-0.434]
51 Figure 2 shows tries of 3-grams for the two example sequences x and y. [sent-300, score-0.328]
52 Note, that for the embedding language of k-grams considered in Figure 2 all marked nodes are leaves, while for other embedding languages they may correspond to inner nodes, for example, for the case of blended k-grams, where every node in a trie marks the end of a sequence. [sent-301, score-1.048]
53 For convenience, we assume that child nodes are maintained in a vector, while in practice sorted lists, balanced trees or hash tables may be preferred. [sent-303, score-0.35]
54 The trie-based approach enables linear-time similarity measures over a larger set of formal languages than sorted arrays. [sent-314, score-0.549]
55 For tries we require all w ∈ L with w ⊑ x to have either constant overlap in x or to be prefix of another word, for example, as for the blended k-gram embedding languages. [sent-315, score-0.432]
56 To determine the run-time complexity on tries, we need to consider the following property: A trie storing n words of maximum length k has depth k and at most kn nodes. [sent-316, score-0.327]
57 Thus, a sequence x containing O(|x|) words of maximum length k is embedded using a trie in O(k|x|) run-time. [sent-317, score-0.384]
58 The first extension for the trie data structure is aggregation of embedding values in nodes. [sent-322, score-0.419]
59 ¯ Instead of recursively descending at a mismatching node x, one uses ϕx to retrieve the aggregation of all lower embedding values. [sent-326, score-0.328]
60 The extension allows disjunctive similarity measures to be computed as efficient as conjunctive measures at a worst-case run-time of O(k min(|x|, |y|)). [sent-327, score-0.431]
61 A generalized suffix tree (GST) is a compact trie containing all suffixes of a set of sequences x1 , . [sent-338, score-0.469]
62 In the remaining part we focus on the case of two sequences x and y delimited by $1 and $2 , computation of similarity measures over a set of sequences being a straightforward extension. [sent-350, score-0.692]
63 At a node v the function takes length[v] and depth[v] of v as arguments to determine how much the node and its incoming edge contribute to the similarity measure, for example, for the embedding language of k-grams only nodes up to a path depth of k need to be considered. [sent-360, score-0.736]
64 The filter implements the embedding language L = A∗ . [sent-364, score-0.35]
65 The incoming edge to a node v contributes to a similarity measure by length[v], because exactly length[v] contiguous subsequences terminate on the edge to v. [sent-365, score-0.397]
66 Thus, a GST enables linear-time computation of similarity measures, even if a sequence x contains O(|x|2 ) words and the embedding language corresponds to L = A∗ . [sent-371, score-0.65]
67 Consequently, computation of similarity measures between sequences using a GST can be realized in time linear in the sequence lengths independent of the complexity of L. [sent-376, score-0.532]
68 A suffix array is an integer array enumerating the suffixes of a sequence z in lexicographical order. [sent-380, score-0.358]
69 Using a suffix array and an array of longest-common prefixes (LCP) for suffixes, we replace the traversal of the GST by looping over a generalized suffix array in linear time. [sent-388, score-0.552]
70 Application of suffix arrays reduces memory requirements by a factor of 4. [sent-389, score-0.34]
71 Experiments and Applications In order to evaluate the run-time performance of the proposed comparison algorithms in practice and to investigate the effect of different similarity measures on sequential data, we conducted experiments on real world sequences. [sent-394, score-0.377]
72 Elegans DNA sequences SPROT Protein sequences Text processing Reuters News articles Heise News articles RFC Text documents Computer security HIDS System call traces Connection payloads NIDS Spam Emails bodies # |A| |x|µ 46794 10025 150807 4 4 23 2400 10000 467 Sonnenburg et al. [sent-398, score-0.608]
73 The number of sequences in each set is denoted by #, the alphabet size by |A| and the average sequence length by |x|µ . [sent-408, score-0.382]
74 In particular, for Algorithm 1 we incorporated 64-bit variables to realize a sorted 64-bit array, for Algorithm 2 we implemented a compact trie representation and for Algorithm 3 we used generalized suffix arrays in favor of suffix trees. [sent-414, score-0.774]
75 For each of these algorithms we conducted experiments using different embedding languages to assess the run-time on the data sets given in Table 5. [sent-415, score-0.338]
76 We applied the following experimental procedure and averaged run-time over 10 individual runs: 500 sequences are randomly drawn from a data set and a 500 × 500 matrix is computed for the Manhattan distance using a chosen embedding language. [sent-416, score-0.493]
77 In particular, we use the embedding language of word k-grams—covering the classic “bag of words” as word 1-grams—by using an alphabet of words instead of characters. [sent-425, score-0.658]
78 The sorted array approach significantly outperforms the other algorithms on all data sets, yet it can only be applied for k ≤ 2, as it is limited to 64 bits. [sent-429, score-0.325]
79 For small values of k suffix arrays require more time for each comparison compared to compact tries, while for k > 5 their performance is similar to compact tries. [sent-430, score-0.494]
80 For this experiment we focus on the embedding language of kgrams, which is not limited to a particular type of sequences, so that experiments were conducted for all data sets in Table 5. [sent-435, score-0.35]
81 Moreover, the limitation of sorted arrays to 64 bits does not effect all data sets, so that for DNA all k-gram lengths can be computed. [sent-440, score-0.507]
82 The suffix array slightly outperforms the trie comparison for larger value of k, as its worst-case run-time is independent of the length of k-grams. [sent-441, score-0.448]
83 Elegans) 10 1 4 7 10 13 k−gram length 16 19 1 4 7 10 13 k−gram length 16 19 Figure 5: Run-time of sequences comparison over k-grams for different algorithms. [sent-446, score-0.384]
84 2 Applications We now demonstrate that the ability of our approach to compute diverse similarity measures is beneficial in real applications, especially in unsupervised learning scenarios. [sent-453, score-0.344]
85 We then compute dissimilarity matrices for the Euclidean, Manhattan and Jensen-Shannon distances using the bag-of-words embedding language as discussed in Section 3. [sent-459, score-0.422]
86 In order to provide a fair investigation of the impact of various similarity measures on detection of attacks, we generated a smaller private data set. [sent-477, score-0.379]
87 1 1 2 3 Clusters 4 0 1 2 3 Clusters 4 0 1 2 3 Clusters 4 (c) Ratio of topic labels in clusters of embedded articles Figure 6: Clustering of Reuters articles using different similarity measures (bag-of-words). [sent-498, score-0.588]
88 The poor detection performance of the latter two similarity measures emphasizes that choosing a discriminative similarity measure is crucial for achieving high accuracy in a particular application. [sent-502, score-0.594]
89 As only 10% of the sequences in the data set correspond to transcription start sites, we additionally apply the unsupervised outlier detection method Gamma (Harmeling et al. [sent-579, score-0.404]
90 Conclusions The framework for comparison of sequences proposed in this article provides means for efficient computation of a large variety of similarity measures, including kernels, distances and non-metric similarity coefficients. [sent-592, score-0.752]
91 The framework is based on embedding of sequences in a high-dimensional feature space using formal languages, such as k-grams, contiguous subsequences, etc. [sent-593, score-0.503]
92 In general, we have observed a consistent trade-off between run-time efficiency and complexity of embedding languages a particular data structure can handle. [sent-597, score-0.338]
93 Sorted arrays are the most efficient data structure; however, their applicability is limited to k-grams and bag-of-words models. [sent-598, score-0.34]
94 The optimal data structure for computation of similarity measures thus depends on the embedding language to be used in a particular application. [sent-600, score-0.638]
95 The proposed framework offers machine learning researchers an opportunity to use a large variety of similarity measures for applications that involve sequential data. [sent-601, score-0.329]
96 Although an optimal similarity measure—as it is well known and has been also observed in our experiments—depends on a particular application, the technical means for seamless incorporation of various similarity measures can be of great utility in practical applications of machine learning. [sent-602, score-0.503]
97 Especially appealing is the possibility for efficient computation of non-Euclidean distances over embedded sequences, which extend applicable similarity measures for sequences beyond inner-products and kernel functions. [sent-603, score-0.674]
98 Linear-time longest-common-prefix computation in suffix arrays and its applications. [sent-758, score-0.34]
99 Efficient algorithms for similarity measures over sequential u data: A look beyond kernels. [sent-937, score-0.329]
100 Computation of similarity measures for sequential data using generalized suffix trees. [sent-945, score-0.368]
wordName wordTfidf (topN-words)
[('arrays', 0.34), ('embedding', 0.244), ('similarity', 0.215), ('manhattan', 0.206), ('phi', 0.206), ('sequences', 0.202), ('trie', 0.175), ('sorted', 0.167), ('array', 0.158), ('rieck', 0.154), ('gst', 0.144), ('askov', 0.134), ('ieck', 0.134), ('tries', 0.126), ('equential', 0.123), ('suffix', 0.123), ('string', 0.118), ('ww', 0.113), ('kulczynski', 0.113), ('language', 0.106), ('easures', 0.105), ('imilarity', 0.105), ('grams', 0.105), ('articles', 0.102), ('word', 0.097), ('languages', 0.094), ('detection', 0.091), ('subsequences', 0.082), ('measures', 0.073), ('tss', 0.072), ('distances', 0.072), ('alphabet', 0.071), ('intrusion', 0.07), ('leslie', 0.07), ('conjunctive', 0.07), ('kernels', 0.069), ('sonnenburg', 0.067), ('gram', 0.067), ('vishwanathan', 0.067), ('length', 0.067), ('arts', 0.062), ('blended', 0.062), ('nil', 0.062), ('rfc', 0.062), ('child', 0.059), ('contiguous', 0.057), ('embedded', 0.057), ('unsupervised', 0.056), ('kernel', 0.055), ('attacks', 0.055), ('transcription', 0.055), ('bit', 0.054), ('compact', 0.053), ('coffee', 0.051), ('knuth', 0.051), ('laskov', 0.051), ('laub', 0.051), ('salton', 0.051), ('mark', 0.051), ('operations', 0.05), ('weighting', 0.049), ('euclidean', 0.049), ('comparison', 0.048), ('dna', 0.048), ('distance', 0.047), ('xes', 0.047), ('protein', 0.046), ('bytes', 0.044), ('sugar', 0.044), ('anomaly', 0.044), ('nodes', 0.043), ('mismatch', 0.043), ('words', 0.043), ('symbols', 0.043), ('node', 0.043), ('bioinformatics', 0.042), ('sequence', 0.042), ('depth', 0.042), ('pairwise', 0.042), ('suf', 0.042), ('abbaa', 0.041), ('baa', 0.041), ('hash', 0.041), ('ilter', 0.041), ('kasai', 0.041), ('mismatching', 0.041), ('occ', 0.041), ('ompare', 0.041), ('text', 0.041), ('sequential', 0.041), ('trees', 0.04), ('reuters', 0.04), ('tsch', 0.039), ('spam', 0.039), ('traversal', 0.039), ('topic', 0.039), ('generalized', 0.039), ('smola', 0.037), ('inner', 0.037), ('sch', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
Author: Konrad Rieck, Pavel Laskov
Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. Keywords: string kernels, string distances, learning with sequential data
2 0.095999487 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
3 0.082051151 47 jmlr-2008-Learning Balls of Strings from Edit Corrections
Author: Leonor Becerra-Bonache, Colin de la Higuera, Jean-Christophe Janodet, Frédéric Tantini
Abstract: When facing the question of learning languages in realistic settings, one has to tackle several problems that do not admit simple solutions. On the one hand, languages are usually defined by complex grammatical mechanisms for which the learning results are predominantly negative, as the few algorithms are not really able to cope with noise. On the other hand, the learning settings themselves rely either on too simple information (text) or on unattainable one (query systems that do not exist in practice, nor can be simulated). We consider simple but sound classes of languages defined via the widely used edit distance: the balls of strings. We propose to learn them with the help of a new sort of queries, called the correction queries: when a string is submitted to the Oracle, either she accepts it if it belongs to the target language, or she proposes a correction, that is, a string of the language close to the query with respect to the edit distance. We show that even if the good balls are not learnable in Angluin’s M AT model, they can be learned from a polynomial number of correction queries. Moreover, experimental evidence simulating a human Expert shows that this algorithm is resistant to approximate answers. Keywords: grammatical inference, oracle learning, correction queries, edit distance, balls of strings
4 0.077973478 4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters
Author: Zach Jorgensen, Yan Zhou, Meador Inge
Abstract: Statistical spam filters are known to be vulnerable to adversarial attacks. One of the more common adversarial attacks, known as the good word attack, thwarts spam filters by appending to spam messages sets of “good” words, which are words that are common in legitimate email but rare in spam. We present a counterattack strategy that attempts to differentiate spam from legitimate email in the input space by transforming each email into a bag of multiple segments, and subsequently applying multiple instance logistic regression on the bags. We treat each segment in the bag as an instance. An email is classified as spam if at least one instance in the corresponding bag is spam, and as legitimate if all the instances in it are legitimate. We show that a classifier using our multiple instance counterattack strategy is more robust to good word attacks than its single instance counterpart and other single instance learners commonly used in the spam filtering domain. Keywords: spam filtering, multiple instance learning, good word attack, adversarial learning
5 0.076881371 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
6 0.072345071 56 jmlr-2008-Magic Moments for Structured Output Prediction
7 0.056966465 57 jmlr-2008-Manifold Learning: The Price of Normalization
8 0.054621387 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
9 0.052176218 96 jmlr-2008-Visualizing Data using t-SNE
10 0.051937547 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
11 0.050021209 92 jmlr-2008-Universal Multi-Task Kernels
12 0.046183687 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression
13 0.045778159 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
14 0.045604307 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
15 0.043731894 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
16 0.041225892 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
17 0.040328227 58 jmlr-2008-Max-margin Classification of Data with Absent Features
18 0.038905635 85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)
19 0.03861187 86 jmlr-2008-SimpleMKL
topicId topicWeight
[(0, 0.189), (1, -0.047), (2, 0.089), (3, -0.02), (4, -0.052), (5, -0.031), (6, 0.101), (7, 0.119), (8, 0.053), (9, 0.026), (10, -0.162), (11, 0.237), (12, 0.025), (13, 0.083), (14, 0.06), (15, 0.084), (16, -0.095), (17, 0.049), (18, -0.151), (19, -0.358), (20, -0.135), (21, 0.118), (22, -0.129), (23, -0.003), (24, -0.133), (25, 0.009), (26, 0.066), (27, 0.135), (28, 0.107), (29, -0.003), (30, -0.041), (31, 0.017), (32, -0.072), (33, 0.087), (34, -0.042), (35, -0.062), (36, 0.055), (37, 0.104), (38, -0.027), (39, 0.044), (40, -0.105), (41, 0.067), (42, -0.033), (43, 0.086), (44, -0.047), (45, 0.019), (46, 0.058), (47, -0.063), (48, -0.065), (49, -0.135)]
simIndex simValue paperId paperTitle
same-paper 1 0.95956975 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
Author: Konrad Rieck, Pavel Laskov
Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. Keywords: string kernels, string distances, learning with sequential data
2 0.4930037 4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters
Author: Zach Jorgensen, Yan Zhou, Meador Inge
Abstract: Statistical spam filters are known to be vulnerable to adversarial attacks. One of the more common adversarial attacks, known as the good word attack, thwarts spam filters by appending to spam messages sets of “good” words, which are words that are common in legitimate email but rare in spam. We present a counterattack strategy that attempts to differentiate spam from legitimate email in the input space by transforming each email into a bag of multiple segments, and subsequently applying multiple instance logistic regression on the bags. We treat each segment in the bag as an instance. An email is classified as spam if at least one instance in the corresponding bag is spam, and as legitimate if all the instances in it are legitimate. We show that a classifier using our multiple instance counterattack strategy is more robust to good word attacks than its single instance counterpart and other single instance learners commonly used in the spam filtering domain. Keywords: spam filtering, multiple instance learning, good word attack, adversarial learning
3 0.49063784 47 jmlr-2008-Learning Balls of Strings from Edit Corrections
Author: Leonor Becerra-Bonache, Colin de la Higuera, Jean-Christophe Janodet, Frédéric Tantini
Abstract: When facing the question of learning languages in realistic settings, one has to tackle several problems that do not admit simple solutions. On the one hand, languages are usually defined by complex grammatical mechanisms for which the learning results are predominantly negative, as the few algorithms are not really able to cope with noise. On the other hand, the learning settings themselves rely either on too simple information (text) or on unattainable one (query systems that do not exist in practice, nor can be simulated). We consider simple but sound classes of languages defined via the widely used edit distance: the balls of strings. We propose to learn them with the help of a new sort of queries, called the correction queries: when a string is submitted to the Oracle, either she accepts it if it belongs to the target language, or she proposes a correction, that is, a string of the language close to the query with respect to the edit distance. We show that even if the good balls are not learnable in Angluin’s M AT model, they can be learned from a polynomial number of correction queries. Moreover, experimental evidence simulating a human Expert shows that this algorithm is resistant to approximate answers. Keywords: grammatical inference, oracle learning, correction queries, edit distance, balls of strings
4 0.36556995 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
5 0.33215499 56 jmlr-2008-Magic Moments for Structured Output Prediction
Author: Elisa Ricci, Tijl De Bie, Nello Cristianini
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound
6 0.29971889 57 jmlr-2008-Manifold Learning: The Price of Normalization
7 0.27194053 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
8 0.26388165 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression
9 0.25609958 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
10 0.25129351 96 jmlr-2008-Visualizing Data using t-SNE
11 0.23045711 85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)
12 0.22382793 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
13 0.21721926 92 jmlr-2008-Universal Multi-Task Kernels
14 0.21365531 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
15 0.20094852 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
16 0.19653341 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
17 0.185817 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding
18 0.17942677 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
19 0.17929278 86 jmlr-2008-SimpleMKL
20 0.16119 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
topicId topicWeight
[(0, 0.021), (5, 0.024), (31, 0.019), (40, 0.041), (51, 0.013), (54, 0.025), (58, 0.025), (63, 0.488), (66, 0.033), (76, 0.032), (78, 0.022), (88, 0.069), (92, 0.032), (94, 0.056), (99, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.83483934 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
Author: Konrad Rieck, Pavel Laskov
Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. Keywords: string kernels, string distances, learning with sequential data
2 0.22497116 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
3 0.22476585 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
4 0.22291169 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
5 0.22081062 86 jmlr-2008-SimpleMKL
Author: Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet
Abstract: Multiple kernel learning (MKL) aims at simultaneously learning a kernel and the associated predictor in supervised learning settings. For the support vector machine, an efficient and general multiple kernel learning algorithm, based on semi-infinite linear programming, has been recently proposed. This approach has opened new perspectives since it makes MKL tractable for large-scale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs numerous iterations for converging towards a reasonable solution. In this paper, we address the MKL problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations. Apart from learning the combination, we solve a standard SVM optimization problem, where the kernel is defined as a linear combination of multiple kernels. We propose an algorithm, named SimpleMKL, for solving this MKL problem and provide a new insight on MKL algorithms based on mixed-norm regularization by showing that the two approaches are equivalent. We show how SimpleMKL can be applied beyond binary classification, for problems like regression, clustering (one-class classification) or multiclass classification. Experimental results show that the proposed algorithm converges rapidly and that its efficiency compares favorably to other MKL algorithms. Finally, we illustrate the usefulness of MKL for some regressors based on wavelet kernels and on some model selection problems related to multiclass classification problems. Keywords: multiple kernel learning, support vector machines, support vector regression, multiclass SVM, gradient descent ∗. Also at Heudiasyc, CNRS/Universit´ de Technologie de Compi` gne (UMR 6599), 60205 Compi` gne, France. e e e c 2008 Alain Rakotomamonjy, Francis R. Bach, St´ phane Canu and Yves Grandvalet. e R AKOTOMAMONJY, BACH , C ANU AND G RANDVALET
6 0.22045237 57 jmlr-2008-Manifold Learning: The Price of Normalization
7 0.21972971 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
8 0.21967819 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
9 0.21923994 9 jmlr-2008-Active Learning by Spherical Subdivision
10 0.21798596 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
11 0.21772575 96 jmlr-2008-Visualizing Data using t-SNE
13 0.21673492 56 jmlr-2008-Magic Moments for Structured Output Prediction
14 0.21645385 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
15 0.21571271 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
16 0.21469285 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
17 0.21398266 83 jmlr-2008-Robust Submodular Observation Selection
18 0.21319753 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
19 0.21287176 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
20 0.2123069 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers