jmlr jmlr2012 jmlr2012-102 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Konrad Rieck, Christian Wressnegger, Alexander Bikadorov
Abstract: Strings and sequences are ubiquitous in many areas of data analysis. However, only few learning methods can be directly applied to this form of data. We present Sally, a tool for embedding strings in vector spaces that allows for applying a wide range of learning methods to string data. Sally implements a generalized form of the bag-of-words model, where strings are mapped to a vector space that is spanned by a set of string features, such as words or n-grams of words. The implementation of Sally builds on efficient string algorithms and enables processing millions of strings and features. The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. Sally has been successfully applied for learning with natural language text, DNA sequences and monitored program behavior. Keywords: string embedding, bag-of-words models, learning with sequential data
Reference: text
sentIndex sentText sentNum sentScore
1 DE University of G¨ ttingen o Goldschmidtstraße 7 37077 G¨ ttingen, Germany o Christian Wressnegger WRESSNEGGER @ IDALAB . [sent-3, score-0.046]
2 DE idalab GmbH Adalbertstraße 20 10977 Berlin, Germany Alexander Bikadorov ABIKU @ CS . [sent-4, score-0.046]
3 DE Technische Universit¨ t Berlin a Franklinstraße 28/29 10587 Berlin, Germany Editor: Antti Honkela Abstract Strings and sequences are ubiquitous in many areas of data analysis. [sent-6, score-0.046]
4 We present Sally, a tool for embedding strings in vector spaces that allows for applying a wide range of learning methods to string data. [sent-8, score-0.958]
5 Sally implements a generalized form of the bag-of-words model, where strings are mapped to a vector space that is spanned by a set of string features, such as words or n-grams of words. [sent-9, score-0.877]
6 The implementation of Sally builds on efficient string algorithms and enables processing millions of strings and features. [sent-10, score-0.832]
7 The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. [sent-11, score-0.213]
8 Sally has been successfully applied for learning with natural language text, DNA sequences and monitored program behavior. [sent-12, score-0.073]
9 Keywords: string embedding, bag-of-words models, learning with sequential data 1. [sent-13, score-0.347]
10 Introduction Strings and sequences are a common representation of data and many applications of machine learning center on analyzing strings, for example, for discovering topics in natural language text, identifying genes in DNA, or filtering spam messages. [sent-14, score-0.073]
11 However, the vast majority of learning methods can not be directly applied to string data, as these methods usually operate in vector spaces and require numerical vectors as input for learning. [sent-15, score-0.347]
12 While a large body of research has studied techniques for learning with strings—most notably work on mapping strings to vectors (see Salton et al. [sent-16, score-0.461]
13 , 1975; Damashek, 1995) and string kernels (see Sonnenburg et al. [sent-17, score-0.347]
14 In this article, we present Sally, a general-purpose tool for mapping a set of strings to a set of vectors. [sent-19, score-0.525]
15 This mapping is referred to as embedding of strings and allows for applying a wide range of learning methods to string data. [sent-20, score-0.916]
16 Sally implements a generalized form of the bag-of-words model, where strings are mapped to a high-dimensional vector space that is spanned by a set of string features. [sent-21, score-0.856]
17 Different types of features are supported for this embedding, which range from words c 2012 Konrad Rieck, Christian Wressnegger and Alexander Bikadorov. [sent-22, score-0.074]
18 R IECK , W RESSNEGGER AND B IKADOROV delimited by whitespace characters to positional and sorted n-grams. [sent-23, score-0.19]
19 Sally proceeds by counting the occurrences of these features in each string and generating a sparse vector of frequency values. [sent-24, score-0.412]
20 The implementation of Sally builds on string algorithms with linear run-time and space complexity, which enables processing millions of strings and features. [sent-25, score-0.832]
21 Sally is not the only tool for learning with strings; some learning toolboxes also provide support for extracting features from strings or computing string kernels. [sent-26, score-0.914]
22 The respective implementations are often tightly coupled with the toolboxes and restricted to specific applications. [sent-27, score-0.035]
23 By contrast, Sally can be thought of as a “Swiss Army Knife” for embedding strings: Instead of targeting a single toolbox or application, Sally provides a generic link between string data and learning methods. [sent-28, score-0.501]
24 The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. [sent-29, score-0.213]
25 Sally has been successfully applied in diverse learning settings, including text categorization, intrusion detection, clustering of malicious software, and analysis of electricity consumption (cf. [sent-30, score-0.142]
26 Embedding of Strings Sally implements a generalized form of the classic bag-of-words model (Salton et al. [sent-36, score-0.041]
27 A string is represented by a set of string features (“the bag”) and mapped to a vector space whose dimensions are associated with the occurrences of these features. [sent-38, score-0.788]
28 This association is created using a hash function, where the hash value of each feature defines its dimension. [sent-39, score-0.156]
29 Moreover, the tool extends the original bag-of-words model by supporting features derived from string kernels, such as the spectrum kernel (Leslie et al. [sent-40, score-0.501]
30 1 String Features Sally supports three basic types of string features for constructing a bag-of-words model. [sent-45, score-0.447]
31 These types are defined implicitly by specifying delimiters (Configuration: ngram delim) and the number of consecutive bytes/words to consider (Configuration: ngram len). [sent-46, score-0.483]
32 The strings are partitioned into substrings using a set of delimiter characters D. [sent-49, score-0.482]
33 Such partitioning is typical for natural language processing, where the delimiters are usually defined as whitespace and punctuation characters (Configuration: ngram delim = D; and ngram len = 1;). [sent-50, score-0.744]
34 The strings are characterized by overlapping byte sequences of length n. [sent-53, score-0.593]
35 These features are frequently used, if only little information about the structure of the strings is known, such as in bioinformatics and computer security (Configuration: ngram delim = ""; and ngram len = n;). [sent-54, score-1.09]
36 The strings are described by overlapping word sequences of length n. [sent-57, score-0.559]
37 These features require the definition of delimiters D and a length n. [sent-58, score-0.098]
38 They are often used as a coarse way for capturing structure in text and tokens (Configuration: ngram delim = D; and ngram len = n;). [sent-59, score-0.616]
39 Additionally to these basic types, Sally supports extensions for refining the set of string features. [sent-60, score-0.394]
40 For instance, inspired by the weighted-degree kernel (Sonnenburg et al. [sent-61, score-0.023]
41 , 2007), Sally supports the 3248 S ALLY extraction of positional features. [sent-62, score-0.127]
42 Each string feature is extracted along with its position in the string (Configuration: ngram pos = 1;). [sent-63, score-0.889]
43 As an example, the string abab then contains the positional 2-grams ab1 , ba2 and ab3 . [sent-64, score-0.4]
44 Moreover, as with any data, strings can suffer from noise. [sent-65, score-0.461]
45 Words may be swapped in text and DNA bases flip positions due to mutations. [sent-66, score-0.078]
46 Such noise can be compensated by the extension of sorted n-grams (Configuration: ngram sort = 1;). [sent-67, score-0.243]
47 After the extraction of an n-gram, its elements are sorted, thereby removing the local ordering of the data. [sent-68, score-0.027]
48 This removal may improve performance, if the strings suffer from local perturbations. [sent-69, score-0.461]
49 Finally, Sally also supports the use of stop words (Configuration: stopword file) and the thresholding of values (Configuration: thres low and thres high). [sent-70, score-0.16]
50 When analyzing natural language text, these extensions allow for filtering irrelevant and (in)frequent words from the resulting feature vectors. [sent-71, score-0.048]
51 2 Embedding Function After the extraction of string features, Sally proceeds to construct a feature vector for each string in the defined output format. [sent-73, score-0.721]
52 This construction can be formally defined as a mapping function Φ (see Rieck and Laskov, 2008), Φ : X −→ R|S | , Φ : x −→ (φs (x))s∈S , where X corresponds to the domain of strings and S to the set of (hashed) string features. [sent-74, score-0.808]
53 Depending on the configuration, the inner function φ either returns the number of occurrences of the feature s in the string x, a binary flag for the occurrence of s or an TFIDF weighting of s. [sent-75, score-0.383]
54 Furthermore, different normalizations can be applied to the resulting vectors (Configuration: vect embed = embed and vect norm = norm). [sent-76, score-0.206]
55 The size of the vector space can be controlled using the number of bits for hashing the string features (Configuration: hash bits), where for k bits the vector space may span up to 2k dimensions. [sent-77, score-0.611]
56 Fortunately, the number of features extracted by Sally is linear in length of each string and the resulting vectors are extremely sparse. [sent-79, score-0.376]
57 This sparsity enables Sally to embed the strings in linear time and space, irrespective of the dimensionality of the vector space (cf. [sent-80, score-0.506]
58 Run-time Evaluation To illustrate the efficient implementation of Sally, we conduct a run-time evaluation with data sets from four application domains: DNA sequences (ARTS; Sonnenburg et al. [sent-84, score-0.046]
59 , 2002), email messages (ENRON; Klimt and Yang, 2004) and text documents (RFC; www. [sent-86, score-0.158]
60 As a baseline, we consider a typical Matlab and Python script for embedding strings. [sent-91, score-0.173]
61 Both scripts are 60–70 lines long and make use of hashing for efficiently mapping strings to vectors. [sent-92, score-0.538]
62 For each data set and implementation, we measure the run-time for embedding strings using byte/word 5-grams as features. [sent-93, score-0.547]
63 We set the size of the feature hashing to 24 bits (≈ 16. [sent-94, score-0.105]
64 All implementations are able to embed the strings in reasonable time (less than 10 minutes). [sent-99, score-0.486]
65 In comparison with the Python script, Sally embeds the strings 2. [sent-101, score-0.439]
66 This performance demonstrates the efficiency of Sally on different types of data, rendering it the tool of choice in applications where time matters, such as in large-scale and on-line learning. [sent-104, score-0.108]
67 Data sets ARTS Data set size 108 DNA bases Number of strings 46,794 String features byte 5-grams Number of features 1,024 (= 45 ) Run-time performance Matlab script 528 s (9. [sent-105, score-0.671]
68 5×) Sally 55 s — SPROT 107 proteins 150,807 byte 5-grams 2,800,728 ENRON 106 words 33,702 word 5-grams 4,070,853 RFC 107 words 4,590 word 5-grams 12,136,059 381 s (7. [sent-107, score-0.238]
69 1×) 55 s — Table 1: Run-time performance of Sally and typical scripts for embedding strings. [sent-113, score-0.132]
70 The embedding shows a linear run-time complexity even for large strings. [sent-115, score-0.108]
71 On average, Sally is able to map a string to a vector within 0. [sent-116, score-0.347]
72 3 ms including reading and writing of data, which amounts to an overall throughput of 3,000 strings per second. [sent-117, score-0.48]
73 The run-time per string is measured on the email messages of ENRON and the protein sequences of SPROT. [sent-119, score-0.551]
74 Conclusions Sally provides a generic link between the wealth of string data and the available machinery of learning methods. [sent-121, score-0.347]
75 Instead of targeting a specific application domain, the tool implements a generalized form of the bag-of-words model, which allows for learning with various types of string data, such as natural language text (Rieck and Laskov, 2008), payloads of network packets (Wahl et al. [sent-122, score-0.606]
76 , 2009) or even traces of electricity consumption (Jawurek et al. [sent-123, score-0.085]
77 Moreover, by supporting different input and output formats, the tool can easily interface with common learning environments. [sent-125, score-0.083]
78 The Enron corpus: a new dataset for email classification research. [sent-143, score-0.065]
79 The spectrum kernel: a string kernel for SVM protein classification. [sent-150, score-0.446]
wordName wordTfidf (topN-words)
[('sally', 0.606), ('strings', 0.439), ('string', 0.347), ('rieck', 0.215), ('ngram', 0.195), ('embedding', 0.108), ('guration', 0.1), ('delim', 0.091), ('enron', 0.091), ('byte', 0.088), ('hash', 0.078), ('len', 0.078), ('laskov', 0.076), ('dna', 0.071), ('delimiters', 0.069), ('jawurek', 0.069), ('salton', 0.069), ('wahl', 0.069), ('wressnegger', 0.069), ('script', 0.065), ('email', 0.065), ('tool', 0.064), ('konrad', 0.059), ('protein', 0.057), ('text', 0.057), ('word', 0.054), ('hashing', 0.053), ('arts', 0.053), ('positional', 0.053), ('sonnenburg', 0.053), ('bits', 0.052), ('embed', 0.047), ('supports', 0.047), ('sequences', 0.046), ('consumption', 0.046), ('donovan', 0.046), ('idalab', 0.046), ('ieck', 0.046), ('ikadorov', 0.046), ('klimt', 0.046), ('ressnegger', 0.046), ('rfc', 0.046), ('sprot', 0.046), ('targeting', 0.046), ('thres', 0.046), ('ttingen', 0.046), ('vect', 0.046), ('whitespace', 0.046), ('formats', 0.045), ('characters', 0.043), ('python', 0.043), ('implements', 0.041), ('ms', 0.041), ('ally', 0.039), ('lodhi', 0.039), ('electricity', 0.039), ('interfacing', 0.039), ('security', 0.037), ('messages', 0.036), ('occurrences', 0.036), ('toolboxes', 0.035), ('weka', 0.032), ('matlab', 0.032), ('shi', 0.03), ('shogun', 0.03), ('berlin', 0.03), ('features', 0.029), ('mapped', 0.029), ('leslie', 0.029), ('sorted', 0.028), ('language', 0.027), ('christian', 0.027), ('extraction', 0.027), ('bioinformatics', 0.026), ('millions', 0.026), ('germany', 0.026), ('types', 0.024), ('scripts', 0.024), ('con', 0.024), ('kernel', 0.023), ('mapping', 0.022), ('environments', 0.022), ('suffer', 0.022), ('bases', 0.021), ('words', 0.021), ('categorization', 0.02), ('enables', 0.02), ('overlapping', 0.02), ('gauging', 0.02), ('normalizations', 0.02), ('compensated', 0.02), ('delimited', 0.02), ('dror', 0.02), ('rendering', 0.02), ('securing', 0.02), ('willems', 0.02), ('supporting', 0.019), ('cristianini', 0.019), ('spectrum', 0.019), ('capable', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces
Author: Konrad Rieck, Christian Wressnegger, Alexander Bikadorov
Abstract: Strings and sequences are ubiquitous in many areas of data analysis. However, only few learning methods can be directly applied to this form of data. We present Sally, a tool for embedding strings in vector spaces that allows for applying a wide range of learning methods to string data. Sally implements a generalized form of the bag-of-words model, where strings are mapped to a vector space that is spanned by a set of string features, such as words or n-grams of words. The implementation of Sally builds on efficient string algorithms and enables processing millions of strings and features. The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. Sally has been successfully applied for learning with natural language text, DNA sequences and monitored program behavior. Keywords: string embedding, bag-of-words models, learning with sequential data
2 0.052555718 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
Author: Stephen R. Piccolo, Lewis J. Frey
Abstract: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classification, parallel, ensemble, reproducible research
3 0.04703616 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features
Author: Gil Tahan, Lior Rokach, Yuval Shahar
Abstract: This paper proposes several novel methods, based on machine learning, to detect malware in executable files without any need for preprocessing, such as unpacking or disassembling. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware files. By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. In addition, Mal-ID uses a new kind of feature, termed meta-feature, to better capture the properties of the analyzed segments. Rather than using the entire file, as is usually the case with machine learning based techniques, the new approach detects malware on the segment level. This study also introduces two Mal-ID extensions that improve the Mal-ID basic method in various aspects. We rigorously evaluated Mal-ID and its two extensions with more than ten performance measures, and compared them to the highly rated boosted decision tree method under identical settings. The evaluation demonstrated that Mal-ID and the two Mal-ID extensions outperformed the boosted decision tree method in almost all respects. In addition, the results indicated that by extracting meaningful features, it is sufficient to employ one simple detection rule for classifying executable files. Keywords: computer security, malware detection, common segment analysis, supervised learning
4 0.036247373 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition
Author: Gary B. Huang, Andrew Kae, Carl Doersch, Erik Learned-Miller
Abstract: We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identified with near certainty, they can be conditioned upon, allowing further inference to be done efficiently. Specifically, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This “clean set” is subsequently used as document-specific training data. While OCR systems produce confidence measures for the identity of each letter or word, thresholding these values still produces a significant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difficult historical newspaper scans, demonstrating that our method for identifying correct words makes only two errors in 56 documents. Using document-specific character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% from an initial OCR system’s translation.1 Keywords: optical character recognition, probability bounding, document-specific modeling, computer vision 1. This work is an expanded and revised version of Kae et al. (2010). Supported by NSF Grant IIS-0916555. c 2012 Gary B. Huang, Andrew Kae, Carl Doersch and Erik Learned-Miller. H UANG , K AE , D OERSCH AND L EARNED -M ILLER
5 0.02988068 90 jmlr-2012-Pattern for Python
Author: Tom De Smedt, Walter Daelemans
Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern. Keywords: Python, data mining, natural language processing, machine learning, graph networks
6 0.02859666 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
7 0.028520448 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
8 0.026371881 44 jmlr-2012-Feature Selection via Dependence Maximization
9 0.021498531 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development
10 0.019704953 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
11 0.019282622 53 jmlr-2012-Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences
12 0.018234698 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
13 0.018116813 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
14 0.017199164 75 jmlr-2012-NIMFA : A Python Library for Nonnegative Matrix Factorization
15 0.016385484 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
16 0.015870702 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
17 0.015649531 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy
18 0.01487446 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
19 0.014740087 12 jmlr-2012-Active Clustering of Biological Sequences
20 0.014717944 4 jmlr-2012-A Kernel Two-Sample Test
topicId topicWeight
[(0, -0.057), (1, 0.009), (2, 0.104), (3, 0.036), (4, 0.033), (5, 0.021), (6, 0.059), (7, -0.022), (8, 0.007), (9, 0.005), (10, -0.011), (11, -0.049), (12, 0.067), (13, -0.084), (14, -0.055), (15, 0.057), (16, 0.045), (17, -0.112), (18, -0.044), (19, -0.108), (20, 0.019), (21, 0.039), (22, -0.048), (23, -0.118), (24, 0.015), (25, -0.179), (26, 0.039), (27, -0.079), (28, 0.061), (29, 0.018), (30, 0.251), (31, -0.254), (32, 0.026), (33, 0.143), (34, 0.027), (35, -0.048), (36, 0.087), (37, -0.244), (38, 0.11), (39, -0.034), (40, -0.137), (41, 0.011), (42, 0.061), (43, -0.045), (44, 0.013), (45, -0.111), (46, -0.079), (47, 0.174), (48, -0.156), (49, -0.151)]
simIndex simValue paperId paperTitle
same-paper 1 0.98491186 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces
Author: Konrad Rieck, Christian Wressnegger, Alexander Bikadorov
Abstract: Strings and sequences are ubiquitous in many areas of data analysis. However, only few learning methods can be directly applied to this form of data. We present Sally, a tool for embedding strings in vector spaces that allows for applying a wide range of learning methods to string data. Sally implements a generalized form of the bag-of-words model, where strings are mapped to a vector space that is spanned by a set of string features, such as words or n-grams of words. The implementation of Sally builds on efficient string algorithms and enables processing millions of strings and features. The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. Sally has been successfully applied for learning with natural language text, DNA sequences and monitored program behavior. Keywords: string embedding, bag-of-words models, learning with sequential data
2 0.59638554 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features
Author: Gil Tahan, Lior Rokach, Yuval Shahar
Abstract: This paper proposes several novel methods, based on machine learning, to detect malware in executable files without any need for preprocessing, such as unpacking or disassembling. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware files. By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. In addition, Mal-ID uses a new kind of feature, termed meta-feature, to better capture the properties of the analyzed segments. Rather than using the entire file, as is usually the case with machine learning based techniques, the new approach detects malware on the segment level. This study also introduces two Mal-ID extensions that improve the Mal-ID basic method in various aspects. We rigorously evaluated Mal-ID and its two extensions with more than ten performance measures, and compared them to the highly rated boosted decision tree method under identical settings. The evaluation demonstrated that Mal-ID and the two Mal-ID extensions outperformed the boosted decision tree method in almost all respects. In addition, the results indicated that by extracting meaningful features, it is sufficient to employ one simple detection rule for classifying executable files. Keywords: computer security, malware detection, common segment analysis, supervised learning
3 0.47324145 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
Author: Stephen R. Piccolo, Lewis J. Frey
Abstract: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classification, parallel, ensemble, reproducible research
4 0.28309444 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory
Author: Tamer Salman, Yoram Baram
Abstract: We describe a quantum algorithm for computing the intersection of two sets and its application to associative memory. The algorithm is based on a modification of Grover’s quantum search algorithm (Grover, 1996). We present algorithms for pattern retrieval, pattern completion, and pattern correction. We show that the quantum associative memory can store an exponential number of memories and retrieve them in sub-exponential time. We prove that this model has advantages over known classical associative memories as well as previously proposed quantum models. Keywords: associative memory, pattern completion, pattern correction, quantum computation, quantum search
5 0.28162289 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
Author: Marius Kloft, Pavel Laskov
Abstract: Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution,1 we analyze the performance of a particular method—online centroid anomaly detection—in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, and analysis of attack efficiency and limitations. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly detection under different conditions: attacker’s full or limited control over the traffic and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difficult (a strict upper bound on the attacker’s gain) if external constraints are properly used. Our experimental evaluation, carried out on real traces of HTTP and exploit traffic, confirms the tightness of our theoretical bounds and the practicality of our protection mechanisms. Keywords: anomaly detection, adversarial, security analysis, support vector data description, computer security, network intrusion detection
6 0.27144206 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition
7 0.23261626 53 jmlr-2012-Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences
8 0.20649278 90 jmlr-2012-Pattern for Python
9 0.16913632 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information
10 0.16422486 44 jmlr-2012-Feature Selection via Dependence Maximization
11 0.13808988 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
12 0.1368801 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
13 0.12320639 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy
14 0.12274276 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems
15 0.12144171 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
16 0.1130879 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
17 0.096951 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
18 0.096914701 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
19 0.094672874 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression
20 0.091873251 4 jmlr-2012-A Kernel Two-Sample Test
topicId topicWeight
[(7, 0.016), (21, 0.017), (26, 0.041), (27, 0.016), (29, 0.017), (35, 0.02), (49, 0.02), (56, 0.051), (68, 0.011), (69, 0.016), (75, 0.058), (77, 0.019), (81, 0.015), (89, 0.489), (92, 0.023), (96, 0.066)]
simIndex simValue paperId paperTitle
same-paper 1 0.82019782 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces
Author: Konrad Rieck, Christian Wressnegger, Alexander Bikadorov
Abstract: Strings and sequences are ubiquitous in many areas of data analysis. However, only few learning methods can be directly applied to this form of data. We present Sally, a tool for embedding strings in vector spaces that allows for applying a wide range of learning methods to string data. Sally implements a generalized form of the bag-of-words model, where strings are mapped to a vector space that is spanned by a set of string features, such as words or n-grams of words. The implementation of Sally builds on efficient string algorithms and enables processing millions of strings and features. The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. Sally has been successfully applied for learning with natural language text, DNA sequences and monitored program behavior. Keywords: string embedding, bag-of-words models, learning with sequential data
2 0.20626 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir
Abstract: Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-infinite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to significant improvements in the efficiency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ℓ p MK-FDA, fixed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ℓ p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ℓ p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the unified framework of regularised kernel machines. Keywords: multiple kernel learning, kernel fisher discriminant analysis, regularised least squares, support vector machines
3 0.20569114 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel
Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identification of an appropriate distance metric on the input data. It is often beneficial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite. Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semidefinite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting methods are easy to implement, efficient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semidefinite matrix with trace and rank being one rather than a classifier or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classification accuracy and running time. Keywords: Mahalanobis distance, semidefinite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor
4 0.20150855 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
Author: Jun Zhu, Amr Ahmed, Eric P. Xing
Abstract: A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a unified constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classification or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efficient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efficient than existing supervised topic models, especially for classification. Keywords: supervised topic models, max-margin learning, maximum entropy discrimination, latent Dirichlet allocation, support vector machines
5 0.2008929 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
Author: Trinh Minh Tri Do, Thierry Artières
Abstract: Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efficient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efficient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large number of modern machine learning problems such as logistic regression, conditional random fields, large margin estimation, etc. We propose a novel algorithm for minimizing a regularized objective that is able to handle convex and non-convex, smooth and non-smooth risks. The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. It may be thought as a limited memory extension of convex regularized bundle methods for dealing with convex and non convex risks. In case the risk is convex the algorithm is proved to converge to a stationary solution with accuracy ε with a rate O(1/λε) where λ is the regularization parameter of the objective function under the assumption of a Lipschitz empirical risk. In case the risk is not convex getting such a proof is more difficult and requires a stronger and more disputable assumption. Yet we provide experimental results on artificial test problems, and on five standard and difficult machine learning problems that are cast as convex and non-convex optimization problems that show how our algorithm compares well in practice with state of the art optimization algorithms. Keywords: optimization, non-convex, non-smooth, cutting plane, bundle method, regularized risk
6 0.19680093 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
7 0.19603059 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression
8 0.1960016 44 jmlr-2012-Feature Selection via Dependence Maximization
9 0.19569196 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
10 0.19558658 54 jmlr-2012-Large-scale Linear Support Vector Regression
11 0.19509974 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
12 0.19470568 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
13 0.19453466 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
14 0.19316195 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
15 0.19310351 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
16 0.19310278 106 jmlr-2012-Sign Language Recognition using Sub-Units
17 0.19295111 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
18 0.19113046 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
19 0.19098479 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
20 0.19071984 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems