jmlr jmlr2012 jmlr2012-102 knowledge-graph by maker-knowledge-mining

102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces

Source: pdf

Author: Konrad Rieck, Christian Wressnegger, Alexander Bikadorov

Abstract: Strings and sequences are ubiquitous in many areas of data analysis. However, only few learning methods can be directly applied to this form of data. We present Sally, a tool for embedding strings in vector spaces that allows for applying a wide range of learning methods to string data. Sally implements a generalized form of the bag-of-words model, where strings are mapped to a vector space that is spanned by a set of string features, such as words or n-grams of words. The implementation of Sally builds on efﬁcient string algorithms and enables processing millions of strings and features. The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. Sally has been successfully applied for learning with natural language text, DNA sequences and monitored program behavior. Keywords: string embedding, bag-of-words models, learning with sequential data

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 DE University of G¨ ttingen o Goldschmidtstraße 7 37077 G¨ ttingen, Germany o Christian Wressnegger WRESSNEGGER @ IDALAB . [sent-3, score-0.046]

2 DE idalab GmbH Adalbertstraße 20 10977 Berlin, Germany Alexander Bikadorov ABIKU @ CS . [sent-4, score-0.046]

3 DE Technische Universit¨ t Berlin a Franklinstraße 28/29 10587 Berlin, Germany Editor: Antti Honkela Abstract Strings and sequences are ubiquitous in many areas of data analysis. [sent-6, score-0.046]

4 We present Sally, a tool for embedding strings in vector spaces that allows for applying a wide range of learning methods to string data. [sent-8, score-0.958]

5 Sally implements a generalized form of the bag-of-words model, where strings are mapped to a vector space that is spanned by a set of string features, such as words or n-grams of words. [sent-9, score-0.877]

6 The implementation of Sally builds on efﬁcient string algorithms and enables processing millions of strings and features. [sent-10, score-0.832]

7 The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. [sent-11, score-0.213]

8 Sally has been successfully applied for learning with natural language text, DNA sequences and monitored program behavior. [sent-12, score-0.073]

9 Keywords: string embedding, bag-of-words models, learning with sequential data 1. [sent-13, score-0.347]

10 Introduction Strings and sequences are a common representation of data and many applications of machine learning center on analyzing strings, for example, for discovering topics in natural language text, identifying genes in DNA, or ﬁltering spam messages. [sent-14, score-0.073]

11 However, the vast majority of learning methods can not be directly applied to string data, as these methods usually operate in vector spaces and require numerical vectors as input for learning. [sent-15, score-0.347]

12 While a large body of research has studied techniques for learning with strings—most notably work on mapping strings to vectors (see Salton et al. [sent-16, score-0.461]

13 , 1975; Damashek, 1995) and string kernels (see Sonnenburg et al. [sent-17, score-0.347]

14 In this article, we present Sally, a general-purpose tool for mapping a set of strings to a set of vectors. [sent-19, score-0.525]

15 This mapping is referred to as embedding of strings and allows for applying a wide range of learning methods to string data. [sent-20, score-0.916]

16 Sally implements a generalized form of the bag-of-words model, where strings are mapped to a high-dimensional vector space that is spanned by a set of string features. [sent-21, score-0.856]

17 Different types of features are supported for this embedding, which range from words c 2012 Konrad Rieck, Christian Wressnegger and Alexander Bikadorov. [sent-22, score-0.074]

18 R IECK , W RESSNEGGER AND B IKADOROV delimited by whitespace characters to positional and sorted n-grams. [sent-23, score-0.19]

19 Sally proceeds by counting the occurrences of these features in each string and generating a sparse vector of frequency values. [sent-24, score-0.412]

20 The implementation of Sally builds on string algorithms with linear run-time and space complexity, which enables processing millions of strings and features. [sent-25, score-0.832]

21 Sally is not the only tool for learning with strings; some learning toolboxes also provide support for extracting features from strings or computing string kernels. [sent-26, score-0.914]

22 The respective implementations are often tightly coupled with the toolboxes and restricted to speciﬁc applications. [sent-27, score-0.035]

23 By contrast, Sally can be thought of as a “Swiss Army Knife” for embedding strings: Instead of targeting a single toolbox or application, Sally provides a generic link between string data and learning methods. [sent-28, score-0.501]

24 The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. [sent-29, score-0.213]

25 Sally has been successfully applied in diverse learning settings, including text categorization, intrusion detection, clustering of malicious software, and analysis of electricity consumption (cf. [sent-30, score-0.142]

26 Embedding of Strings Sally implements a generalized form of the classic bag-of-words model (Salton et al. [sent-36, score-0.041]

27 A string is represented by a set of string features (“the bag”) and mapped to a vector space whose dimensions are associated with the occurrences of these features. [sent-38, score-0.788]

28 This association is created using a hash function, where the hash value of each feature deﬁnes its dimension. [sent-39, score-0.156]

29 Moreover, the tool extends the original bag-of-words model by supporting features derived from string kernels, such as the spectrum kernel (Leslie et al. [sent-40, score-0.501]

30 1 String Features Sally supports three basic types of string features for constructing a bag-of-words model. [sent-45, score-0.447]

31 These types are deﬁned implicitly by specifying delimiters (Conﬁguration: ngram delim) and the number of consecutive bytes/words to consider (Conﬁguration: ngram len). [sent-46, score-0.483]

32 The strings are partitioned into substrings using a set of delimiter characters D. [sent-49, score-0.482]

33 Such partitioning is typical for natural language processing, where the delimiters are usually deﬁned as whitespace and punctuation characters (Conﬁguration: ngram delim = D; and ngram len = 1;). [sent-50, score-0.744]

34 The strings are characterized by overlapping byte sequences of length n. [sent-53, score-0.593]

35 These features are frequently used, if only little information about the structure of the strings is known, such as in bioinformatics and computer security (Conﬁguration: ngram delim = ""; and ngram len = n;). [sent-54, score-1.09]

36 The strings are described by overlapping word sequences of length n. [sent-57, score-0.559]

37 These features require the deﬁnition of delimiters D and a length n. [sent-58, score-0.098]

38 They are often used as a coarse way for capturing structure in text and tokens (Conﬁguration: ngram delim = D; and ngram len = n;). [sent-59, score-0.616]

39 Additionally to these basic types, Sally supports extensions for reﬁning the set of string features. [sent-60, score-0.394]

40 For instance, inspired by the weighted-degree kernel (Sonnenburg et al. [sent-61, score-0.023]

41 , 2007), Sally supports the 3248 S ALLY extraction of positional features. [sent-62, score-0.127]

42 Each string feature is extracted along with its position in the string (Conﬁguration: ngram pos = 1;). [sent-63, score-0.889]

43 As an example, the string abab then contains the positional 2-grams ab1 , ba2 and ab3 . [sent-64, score-0.4]

44 Moreover, as with any data, strings can suffer from noise. [sent-65, score-0.461]

45 Words may be swapped in text and DNA bases ﬂip positions due to mutations. [sent-66, score-0.078]

46 Such noise can be compensated by the extension of sorted n-grams (Conﬁguration: ngram sort = 1;). [sent-67, score-0.243]

47 After the extraction of an n-gram, its elements are sorted, thereby removing the local ordering of the data. [sent-68, score-0.027]

48 This removal may improve performance, if the strings suffer from local perturbations. [sent-69, score-0.461]

49 Finally, Sally also supports the use of stop words (Conﬁguration: stopword file) and the thresholding of values (Conﬁguration: thres low and thres high). [sent-70, score-0.16]

50 When analyzing natural language text, these extensions allow for ﬁltering irrelevant and (in)frequent words from the resulting feature vectors. [sent-71, score-0.048]

51 2 Embedding Function After the extraction of string features, Sally proceeds to construct a feature vector for each string in the deﬁned output format. [sent-73, score-0.721]

52 This construction can be formally deﬁned as a mapping function Φ (see Rieck and Laskov, 2008), Φ : X −→ R|S | , Φ : x −→ (φs (x))s∈S , where X corresponds to the domain of strings and S to the set of (hashed) string features. [sent-74, score-0.808]

53 Depending on the conﬁguration, the inner function φ either returns the number of occurrences of the feature s in the string x, a binary ﬂag for the occurrence of s or an TFIDF weighting of s. [sent-75, score-0.383]

54 Furthermore, different normalizations can be applied to the resulting vectors (Conﬁguration: vect embed = embed and vect norm = norm). [sent-76, score-0.206]

55 The size of the vector space can be controlled using the number of bits for hashing the string features (Conﬁguration: hash bits), where for k bits the vector space may span up to 2k dimensions. [sent-77, score-0.611]

56 Fortunately, the number of features extracted by Sally is linear in length of each string and the resulting vectors are extremely sparse. [sent-79, score-0.376]

57 This sparsity enables Sally to embed the strings in linear time and space, irrespective of the dimensionality of the vector space (cf. [sent-80, score-0.506]

58 Run-time Evaluation To illustrate the efﬁcient implementation of Sally, we conduct a run-time evaluation with data sets from four application domains: DNA sequences (ARTS; Sonnenburg et al. [sent-84, score-0.046]

59 , 2002), email messages (ENRON; Klimt and Yang, 2004) and text documents (RFC; www. [sent-86, score-0.158]

60 As a baseline, we consider a typical Matlab and Python script for embedding strings. [sent-91, score-0.173]

61 Both scripts are 60–70 lines long and make use of hashing for efﬁciently mapping strings to vectors. [sent-92, score-0.538]

62 For each data set and implementation, we measure the run-time for embedding strings using byte/word 5-grams as features. [sent-93, score-0.547]

63 We set the size of the feature hashing to 24 bits (≈ 16. [sent-94, score-0.105]

64 All implementations are able to embed the strings in reasonable time (less than 10 minutes). [sent-99, score-0.486]

65 In comparison with the Python script, Sally embeds the strings 2. [sent-101, score-0.439]

66 This performance demonstrates the efﬁciency of Sally on different types of data, rendering it the tool of choice in applications where time matters, such as in large-scale and on-line learning. [sent-104, score-0.108]

67 Data sets ARTS Data set size 108 DNA bases Number of strings 46,794 String features byte 5-grams Number of features 1,024 (= 45 ) Run-time performance Matlab script 528 s (9. [sent-105, score-0.671]

68 5×) Sally 55 s — SPROT 107 proteins 150,807 byte 5-grams 2,800,728 ENRON 106 words 33,702 word 5-grams 4,070,853 RFC 107 words 4,590 word 5-grams 12,136,059 381 s (7. [sent-107, score-0.238]

69 1×) 55 s — Table 1: Run-time performance of Sally and typical scripts for embedding strings. [sent-113, score-0.132]

70 The embedding shows a linear run-time complexity even for large strings. [sent-115, score-0.108]

71 On average, Sally is able to map a string to a vector within 0. [sent-116, score-0.347]

72 3 ms including reading and writing of data, which amounts to an overall throughput of 3,000 strings per second. [sent-117, score-0.48]

73 The run-time per string is measured on the email messages of ENRON and the protein sequences of SPROT. [sent-119, score-0.551]

74 Conclusions Sally provides a generic link between the wealth of string data and the available machinery of learning methods. [sent-121, score-0.347]

75 Instead of targeting a speciﬁc application domain, the tool implements a generalized form of the bag-of-words model, which allows for learning with various types of string data, such as natural language text (Rieck and Laskov, 2008), payloads of network packets (Wahl et al. [sent-122, score-0.606]

76 , 2009) or even traces of electricity consumption (Jawurek et al. [sent-123, score-0.085]

77 Moreover, by supporting different input and output formats, the tool can easily interface with common learning environments. [sent-125, score-0.083]

78 The Enron corpus: a new dataset for email classiﬁcation research. [sent-143, score-0.065]

79 The spectrum kernel: a string kernel for SVM protein classiﬁcation. [sent-150, score-0.446]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sally', 0.606), ('strings', 0.439), ('string', 0.347), ('rieck', 0.215), ('ngram', 0.195), ('embedding', 0.108), ('guration', 0.1), ('delim', 0.091), ('enron', 0.091), ('byte', 0.088), ('hash', 0.078), ('len', 0.078), ('laskov', 0.076), ('dna', 0.071), ('delimiters', 0.069), ('jawurek', 0.069), ('salton', 0.069), ('wahl', 0.069), ('wressnegger', 0.069), ('script', 0.065), ('email', 0.065), ('tool', 0.064), ('konrad', 0.059), ('protein', 0.057), ('text', 0.057), ('word', 0.054), ('hashing', 0.053), ('arts', 0.053), ('positional', 0.053), ('sonnenburg', 0.053), ('bits', 0.052), ('embed', 0.047), ('supports', 0.047), ('sequences', 0.046), ('consumption', 0.046), ('donovan', 0.046), ('idalab', 0.046), ('ieck', 0.046), ('ikadorov', 0.046), ('klimt', 0.046), ('ressnegger', 0.046), ('rfc', 0.046), ('sprot', 0.046), ('targeting', 0.046), ('thres', 0.046), ('ttingen', 0.046), ('vect', 0.046), ('whitespace', 0.046), ('formats', 0.045), ('characters', 0.043), ('python', 0.043), ('implements', 0.041), ('ms', 0.041), ('ally', 0.039), ('lodhi', 0.039), ('electricity', 0.039), ('interfacing', 0.039), ('security', 0.037), ('messages', 0.036), ('occurrences', 0.036), ('toolboxes', 0.035), ('weka', 0.032), ('matlab', 0.032), ('shi', 0.03), ('shogun', 0.03), ('berlin', 0.03), ('features', 0.029), ('mapped', 0.029), ('leslie', 0.029), ('sorted', 0.028), ('language', 0.027), ('christian', 0.027), ('extraction', 0.027), ('bioinformatics', 0.026), ('millions', 0.026), ('germany', 0.026), ('types', 0.024), ('scripts', 0.024), ('con', 0.024), ('kernel', 0.023), ('mapping', 0.022), ('environments', 0.022), ('suffer', 0.022), ('bases', 0.021), ('words', 0.021), ('categorization', 0.02), ('enables', 0.02), ('overlapping', 0.02), ('gauging', 0.02), ('normalizations', 0.02), ('compensated', 0.02), ('delimited', 0.02), ('dror', 0.02), ('rendering', 0.02), ('securing', 0.02), ('willems', 0.02), ('supporting', 0.019), ('cristianini', 0.019), ('spectrum', 0.019), ('capable', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces

Author: Konrad Rieck, Christian Wressnegger, Alexander Bikadorov

2 0.052555718 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel

Author: Stephen R. Piccolo, Lewis J. Frey

Abstract: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classiﬁcation analyses in a systematic yet ﬂexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classiﬁcation, parallel, ensemble, reproducible research

3 0.04703616 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features

Author: Gil Tahan, Lior Rokach, Yuval Shahar

Abstract: This paper proposes several novel methods, based on machine learning, to detect malware in executable ﬁles without any need for preprocessing, such as unpacking or disassembling. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware ﬁles. By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. In addition, Mal-ID uses a new kind of feature, termed meta-feature, to better capture the properties of the analyzed segments. Rather than using the entire ﬁle, as is usually the case with machine learning based techniques, the new approach detects malware on the segment level. This study also introduces two Mal-ID extensions that improve the Mal-ID basic method in various aspects. We rigorously evaluated Mal-ID and its two extensions with more than ten performance measures, and compared them to the highly rated boosted decision tree method under identical settings. The evaluation demonstrated that Mal-ID and the two Mal-ID extensions outperformed the boosted decision tree method in almost all respects. In addition, the results indicated that by extracting meaningful features, it is sufﬁcient to employ one simple detection rule for classifying executable ﬁles. Keywords: computer security, malware detection, common segment analysis, supervised learning

4 0.036247373 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

Author: Gary B. Huang, Andrew Kae, Carl Doersch, Erik Learned-Miller

Abstract: We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identiﬁed with near certainty, they can be conditioned upon, allowing further inference to be done efﬁciently. Speciﬁcally, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This “clean set” is subsequently used as document-speciﬁc training data. While OCR systems produce conﬁdence measures for the identity of each letter or word, thresholding these values still produces a signiﬁcant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difﬁcult historical newspaper scans, demonstrating that our method for identifying correct words makes only two errors in 56 documents. Using document-speciﬁc character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% from an initial OCR system’s translation.1 Keywords: optical character recognition, probability bounding, document-speciﬁc modeling, computer vision 1. This work is an expanded and revised version of Kae et al. (2010). Supported by NSF Grant IIS-0916555. c 2012 Gary B. Huang, Andrew Kae, Carl Doersch and Erik Learned-Miller. H UANG , K AE , D OERSCH AND L EARNED -M ILLER

5 0.02988068 90 jmlr-2012-Pattern for Python

Author: Tom De Smedt, Walter Daelemans

Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classiﬁers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern. Keywords: Python, data mining, natural language processing, machine learning, graph networks

6 0.02859666 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

7 0.028520448 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

8 0.026371881 44 jmlr-2012-Feature Selection via Dependence Maximization

9 0.021498531 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development

10 0.019704953 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs

11 0.019282622 53 jmlr-2012-Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences

12 0.018234698 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

13 0.018116813 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

14 0.017199164 75 jmlr-2012-NIMFA : A Python Library for Nonnegative Matrix Factorization

15 0.016385484 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

16 0.015870702 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

17 0.015649531 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy

18 0.01487446 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems

19 0.014740087 12 jmlr-2012-Active Clustering of Biological Sequences

20 0.014717944 4 jmlr-2012-A Kernel Two-Sample Test

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.057), (1, 0.009), (2, 0.104), (3, 0.036), (4, 0.033), (5, 0.021), (6, 0.059), (7, -0.022), (8, 0.007), (9, 0.005), (10, -0.011), (11, -0.049), (12, 0.067), (13, -0.084), (14, -0.055), (15, 0.057), (16, 0.045), (17, -0.112), (18, -0.044), (19, -0.108), (20, 0.019), (21, 0.039), (22, -0.048), (23, -0.118), (24, 0.015), (25, -0.179), (26, 0.039), (27, -0.079), (28, 0.061), (29, 0.018), (30, 0.251), (31, -0.254), (32, 0.026), (33, 0.143), (34, 0.027), (35, -0.048), (36, 0.087), (37, -0.244), (38, 0.11), (39, -0.034), (40, -0.137), (41, 0.011), (42, 0.061), (43, -0.045), (44, 0.013), (45, -0.111), (46, -0.079), (47, 0.174), (48, -0.156), (49, -0.151)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98491186 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces

Author: Konrad Rieck, Christian Wressnegger, Alexander Bikadorov

2 0.59638554 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features

Author: Gil Tahan, Lior Rokach, Yuval Shahar

3 0.47324145 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel

Author: Stephen R. Piccolo, Lewis J. Frey

4 0.28309444 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory

Author: Tamer Salman, Yoram Baram

Abstract: We describe a quantum algorithm for computing the intersection of two sets and its application to associative memory. The algorithm is based on a modiﬁcation of Grover’s quantum search algorithm (Grover, 1996). We present algorithms for pattern retrieval, pattern completion, and pattern correction. We show that the quantum associative memory can store an exponential number of memories and retrieve them in sub-exponential time. We prove that this model has advantages over known classical associative memories as well as previously proposed quantum models. Keywords: associative memory, pattern completion, pattern correction, quantum computation, quantum search

5 0.28162289 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

Author: Marius Kloft, Pavel Laskov

Abstract: Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution,1 we analyze the performance of a particular method—online centroid anomaly detection—in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, and analysis of attack efﬁciency and limitations. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly detection under different conditions: attacker’s full or limited control over the trafﬁc and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difﬁcult (a strict upper bound on the attacker’s gain) if external constraints are properly used. Our experimental evaluation, carried out on real traces of HTTP and exploit trafﬁc, conﬁrms the tightness of our theoretical bounds and the practicality of our protection mechanisms. Keywords: anomaly detection, adversarial, security analysis, support vector data description, computer security, network intrusion detection

6 0.27144206 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

7 0.23261626 53 jmlr-2012-Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences

8 0.20649278 90 jmlr-2012-Pattern for Python

9 0.16913632 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

10 0.16422486 44 jmlr-2012-Feature Selection via Dependence Maximization

11 0.13808988 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

12 0.1368801 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

13 0.12320639 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy

14 0.12274276 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems

15 0.12144171 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

16 0.1130879 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

17 0.096951 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

18 0.096914701 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

19 0.094672874 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression

20 0.091873251 4 jmlr-2012-A Kernel Two-Sample Test

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.016), (21, 0.017), (26, 0.041), (27, 0.016), (29, 0.017), (35, 0.02), (49, 0.02), (56, 0.051), (68, 0.011), (69, 0.016), (75, 0.058), (77, 0.019), (81, 0.015), (89, 0.489), (92, 0.023), (96, 0.066)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82019782 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces

Author: Konrad Rieck, Christian Wressnegger, Alexander Bikadorov

2 0.20626 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

Author: Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir

Abstract: Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general ℓ p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-inﬁnite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to signiﬁcant improvements in the efﬁciency and scalability of MK-FDA. We carry out extensive experiments on six datasets from various application areas, and compare closely the performance of ℓ p MK-FDA, ﬁxed norm MK-FDA, and several variants of SVM-based MKL (MK-SVM). Our results demonstrate that ℓ p MK-FDA improves upon sparse MK-FDA in many practical situations. The results also show that on image categorisation problems, ℓ p MK-FDA tends to outperform its SVM counterpart. Finally, we also discuss the connection between (MK-)FDA and (MK-)SVM, under the uniﬁed framework of regularised kernel machines. Keywords: multiple kernel learning, kernel ﬁsher discriminant analysis, regularised least squares, support vector machines

3 0.20569114 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel

Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identiﬁcation of an appropriate distance metric on the input data. It is often beneﬁcial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semideﬁnite. Semideﬁnite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semideﬁnite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semideﬁnite matrices as weak learners within an efﬁcient and scalable boosting-based learning process. The resulting methods are easy to implement, efﬁcient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semideﬁnite matrix with trace and rank being one rather than a classiﬁer or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classiﬁcation accuracy and running time. Keywords: Mahalanobis distance, semideﬁnite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor

4 0.20150855 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

Abstract: A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a uniﬁed constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classiﬁcation or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efﬁcient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efﬁcient than existing supervised topic models, especially for classiﬁcation. Keywords: supervised topic models, max-margin learning, maximum entropy discrimination, latent Dirichlet allocation, support vector machines

5 0.2008929 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

Author: Trinh Minh Tri Do, Thierry Artières

Abstract: Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efﬁcient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efﬁcient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large number of modern machine learning problems such as logistic regression, conditional random ﬁelds, large margin estimation, etc. We propose a novel algorithm for minimizing a regularized objective that is able to handle convex and non-convex, smooth and non-smooth risks. The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. It may be thought as a limited memory extension of convex regularized bundle methods for dealing with convex and non convex risks. In case the risk is convex the algorithm is proved to converge to a stationary solution with accuracy ε with a rate O(1/λε) where λ is the regularization parameter of the objective function under the assumption of a Lipschitz empirical risk. In case the risk is not convex getting such a proof is more difﬁcult and requires a stronger and more disputable assumption. Yet we provide experimental results on artiﬁcial test problems, and on ﬁve standard and difﬁcult machine learning problems that are cast as convex and non-convex optimization problems that show how our algorithm compares well in practice with state of the art optimization algorithms. Keywords: optimization, non-convex, non-smooth, cutting plane, bundle method, regularized risk

6 0.19680093 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

7 0.19603059 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

8 0.1960016 44 jmlr-2012-Feature Selection via Dependence Maximization

9 0.19569196 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

10 0.19558658 54 jmlr-2012-Large-scale Linear Support Vector Regression

11 0.19509974 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

12 0.19470568 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

13 0.19453466 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

14 0.19316195 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

15 0.19310351 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

16 0.19310278 106 jmlr-2012-Sign Language Recognition using Sub-Units

17 0.19295111 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

18 0.19113046 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

19 0.19098479 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

20 0.19071984 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems