jmlr jmlr2011 jmlr2011-102 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michael Gashler
Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Computer Science Brigham Young University Provo, UT 84602, USA Editor: Soeren Sonnenburg Abstract We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. [sent-4, score-0.16]
2 The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. [sent-5, score-0.323]
3 All functionality is also available in a C++ class library. [sent-6, score-0.099]
4 Introduction Although several open source machine learning toolkits already exist (Sonnenburg et al. [sent-9, score-0.11]
5 , 2007), many of them implicitly impose requirements regarding how they can be used. [sent-10, score-0.022]
6 For example, some toolkits require a certain platform, language, or virtual machine. [sent-11, score-0.11]
7 Others are designed such that tools can only be connected together with a specific plug-in, filter, or signal/slot architecture. [sent-12, score-0.162]
8 Unfortunately, these interface differences create difficulty for those who have become familiar with a different methodology, and for those who seek to use tools from multiple tookits together. [sent-13, score-0.268]
9 Toolkits that use a graphical interface may be convenient for performing common experiments, but become cumbersome when the user wishes to use a tool in a manner that was not foreseen by the interface designer, or to automate common and repetitive tasks. [sent-14, score-0.447]
10 Waffles is a collection of tools that seek to provide a wide diversity of useful operations in machine learning and related fields without imposing unnecessary process or interface restrictions on the user. [sent-15, score-0.346]
11 This is done by providing simple command-line interface (CLI) tools that perform basic tasks. [sent-16, score-0.268]
12 Since these tools perform operations at a fairly granular level, they can be used in ways not foreseen by the interface designer. [sent-18, score-0.333]
13 As an example, consider an experiment involving the following seven steps: 1. [sent-19, score-0.022]
14 Use cross-validation to evaluate the accuracy of a bagging ensemble of one-hundred decision trees for classifying the lymph data set (available at http://MLData. [sent-20, score-0.092]
15 Convert input-features to real-valued vectors by representing each nominal attribute as a categorical distribution over possible values. [sent-25, score-0.033]
16 Use principal component analysis to reduce the dimensionality of the feature-vectors. [sent-27, score-0.049]
17 These seven operations can be performed with Waffles tools using the following CLI commands: 1. [sent-35, score-0.186]
18 It is certainly conceivable that a graphical interface could be developed that would make it easy to perform an experiment like this one. [sent-64, score-0.168]
19 Such an interface might even provide some mechanism to automatically perform the same experiment over an array of data sets, and using an array of different models. [sent-65, score-0.204]
20 If, however, the user needs to vary a parameter specific to the experiment, such as the number of principal components, or a model-specific parameter, such as the number of trees in the ensemble, the benefits of a graphical interface are quickly overcome by additional complexity. [sent-66, score-0.23]
21 By contrast, a simple script that calls CLI commands to perform machine learning operations can be directly modified to vary any of the parameters. [sent-67, score-0.055]
22 Additionally, the scripting method can incorporate tools from other toolkits, or even custom-developed tools. [sent-68, score-0.142]
23 Because nearly all programming languages can target CLI applications, there are few barriers to adding custom operations. [sent-69, score-0.055]
24 Figure 1: The model-space visualization generated by the command in step 7. [sent-71, score-0.074]
25 2384 WAFFLES : A M ACHINE L EARNING T OOLKIT Figure 2: A partial screen shot of the Waffles Wizard tool displayed in a web browser. [sent-72, score-0.115]
26 Wizard One significant reason many people prefer to use tools unified within a graphical interface over scriptable CLI tools is that it can be cumbersome to remember which options are available with CLI tools, and to remember how to construct a syntactically-correct command. [sent-74, score-0.525]
27 We solve this problem by providing a “Wizard” tool that guides the user through a series of forms to construct a command that will perform the desired task. [sent-75, score-0.126]
28 A screen shot of this tool (displayed in a web browser) is shown in Figure 2. [sent-76, score-0.095]
29 Rather than execute the selected operation directly, as most GUI tools do, the Waffles Wizard tool merely displays the CLI command that will perform the operation. [sent-77, score-0.229]
30 The user may paste it directly into a command shell to perform the operation immediately, or the user may choose to incorporate it into a script. [sent-78, score-0.167]
31 This gives the user the benefits of a GUI, without the undesirable tendency to lock the user into an interface that is inflexible for scripted automation. [sent-79, score-0.241]
32 Capabilities In order to hilight the capabilities of Waffles, we compare its functionality with that found in Weka (Hall et al. [sent-81, score-0.137]
33 , 2009), which at the time of this writing is the most popular machine learning toolkit by a significant margin. [sent-82, score-0.024]
34 Our intent is not to persuade the reader to choose Waffles instead of Weka, but rather to show that many useful capabilities can be gained by using Waffles in conjunction with Weka, and other toolkits that offer a CLI. [sent-83, score-0.209]
35 One notable strength of Waffles is in unsupervised algorithms, particularly dimensionality reduction techniques. [sent-84, score-0.049]
36 Waffles tools implement principal component analysis (PCA), isomap (Tenenbaum et al. [sent-85, score-0.16]
37 , 2000), locally linear embedding (Roweis and Saul, 2000), manifold sculpting (Gashler et al. [sent-86, score-0.037]
38 , 2011b), unsupervised backpropagation and temporal nonlinear dimensionality reduction (Gashler et al. [sent-88, score-0.066]
39 Waffles contains clustering techniques including k-means, k-medoids, agglomerative clustering, and related transduction algorithms including agglomerative transduction, and max-flow/min-cut transduction (Blum and Chawla, 2001). [sent-91, score-0.148]
40 Waffles provides some of the most-common supervised learning techniques, such as decision trees, multi-layer neural networks, k-nearest neighbor, naive Bayes, and some less-common algorithms, such as Mean-margin trees (Gashler et al. [sent-92, score-0.023]
41 Waffles’ collection of supervised algo2385 G ASHLER rithms is much smaller than that of Weka, which implements more than 50 classification algorithms. [sent-94, score-0.018]
42 Waffles, however, provides an interface that offers several advantages in many situations. [sent-95, score-0.126]
43 For example, Weka requires the user to set up filters that convert data to types that each algorithm can handle. [sent-96, score-0.058]
44 Waffles automatically handles type conversion when an algorithm receives a type that it is not implicitly designed to handle, while still permitting advanced users to specify custom filters. [sent-97, score-0.092]
45 As some algorithm-specific examples, the Waffles implementation of multi-layer perceptron provides the ability to use a diversity of activation functions, and also supplies methods for training recurrent networks. [sent-99, score-0.056]
46 The k-nearest neighbor algorithm automatically supports acceleration structures and sparse training data, so it is suitable for use with problems that require high scalability, such as document classification. [sent-100, score-0.054]
47 As was demonstrated in the first example in this paper, Waffles features a particularly convenient mechanism for creating bagging ensembles. [sent-101, score-0.052]
48 It also provides a diversity of collaborative filtering algorithms and optimization techniques that are not found in Weka. [sent-102, score-0.059]
49 Waffles also provides tools to perform linear-algebraic operations, and various data-mining tools, including attribute selection and several methods for visualization. [sent-103, score-0.175]
50 Architecture The Waffles tools are organized into several executable applications. [sent-105, score-0.16]
51 , recommend, tools related to collaborative filtering, and sparse, tools for learning with sparse matrices. [sent-116, score-0.305]
52 Each tool contained in each of these applications is implemented as a thin wrapper around functionality in a C++ class library, called GClasses. [sent-117, score-0.133]
53 This library is included with Waffles so that any of the functionality available in the Waffles CLI tools can also be linked into C++ applications, or into applications developed in other languages that are capable of linking with C++ libraries. [sent-118, score-0.285]
54 The entire Waffles project is licensed under the GNU Lesser General Public License (LGPL) version 2. [sent-119, score-0.018]
55 Waffles uses a minimal set of dependency libraries, and is carefully designed to support cross-platform compatibility. [sent-125, score-0.02]
56 A new version of Waffles has been released approximately every six months since it was first released to the public in 2005. [sent-127, score-0.112]
57 Full documentation for the CLI tools, including many examples, and also documentation for developers seeking to link with the GClasses library can also be found at that site. [sent-131, score-0.112]
58 In order to augment the developer documentation, several demo applications are also included with Waffles, showing how to build machine learning tools that link with functionality in the GClasses library. [sent-132, score-0.259]
wordName wordTfidf (topN-words)
[('waf', 0.822), ('waffles', 0.238), ('cli', 0.216), ('gashler', 0.195), ('tools', 0.142), ('interface', 0.126), ('toolkits', 0.11), ('wizard', 0.108), ('functionality', 0.099), ('weka', 0.091), ('decisiontree', 0.065), ('ventura', 0.065), ('command', 0.053), ('transduction', 0.05), ('ashler', 0.043), ('dimred', 0.043), ('dropcolumns', 0.043), ('foreseen', 0.043), ('gclasses', 0.043), ('lgpl', 0.043), ('oolkit', 0.043), ('user', 0.039), ('diversity', 0.038), ('public', 0.038), ('capabilities', 0.038), ('scripted', 0.037), ('lesser', 0.037), ('crossvalidate', 0.037), ('gui', 0.037), ('cumbersome', 0.037), ('ijcnn', 0.037), ('sonnenburg', 0.036), ('ensemble', 0.036), ('documentation', 0.036), ('tool', 0.034), ('bagging', 0.033), ('commands', 0.033), ('shot', 0.033), ('custom', 0.033), ('attribute', 0.033), ('bag', 0.033), ('dimensionality', 0.031), ('gnu', 0.03), ('tenenbaum', 0.03), ('doi', 0.03), ('transform', 0.029), ('holmes', 0.028), ('achine', 0.028), ('released', 0.028), ('screen', 0.028), ('pca', 0.027), ('remember', 0.027), ('mike', 0.027), ('blum', 0.027), ('es', 0.027), ('offer', 0.025), ('toolkit', 0.024), ('agglomerative', 0.024), ('graphical', 0.024), ('trees', 0.023), ('operations', 0.022), ('seven', 0.022), ('languages', 0.022), ('implicitly', 0.022), ('library', 0.022), ('visualization', 0.021), ('collaborative', 0.021), ('array', 0.021), ('cybernetics', 0.02), ('displayed', 0.02), ('designed', 0.02), ('neighbor', 0.019), ('convert', 0.019), ('july', 0.019), ('mechanism', 0.019), ('manifold', 0.019), ('principal', 0.018), ('conceivable', 0.018), ('supplies', 0.018), ('executable', 0.018), ('pfahringer', 0.018), ('reutemann', 0.018), ('months', 0.018), ('acceleration', 0.018), ('paste', 0.018), ('shell', 0.018), ('intent', 0.018), ('repetitive', 0.018), ('persuade', 0.018), ('licensed', 0.018), ('demo', 0.018), ('developers', 0.018), ('sculpting', 0.018), ('platform', 0.018), ('collection', 0.018), ('unsupervised', 0.018), ('ltering', 0.017), ('nonlinear', 0.017), ('automatically', 0.017), ('roweis', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 102 jmlr-2011-Waffles: A Machine Learning Toolkit
Author: Michael Gashler
Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source
2 0.064670347 50 jmlr-2011-LPmade: Link Prediction Made Easy
Author: Ryan N. Lichtenwalter, Nitesh V. Chawla
Abstract: LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting highperformance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction models with a version of WEKA modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to generate plots, gigabytes or terabytes of output, and actionable or publishable results. Keywords: link prediction, network analysis, multicore, GNU make, PropFlow, HPLP
3 0.044309702 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas
Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classification, ranking, thresholding, dimensionality reduction, hierarchical classification, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a finite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classification and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The first group of methods are algorithm independent. They transform the learning task into one or more singlelabel classification tasks, for which a large body of learning algorithms exists. The second group of methods extend specific learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classifiers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the benefits of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientific progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classification and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classification, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of specific multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The first thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text files for the specification of a data set. The first one is in the ARFF format of Weka. The labels should be specified as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second file is in XML format. It specifies the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML file by nesting the label tag. In our example, the two filenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(
4 0.03571023 92 jmlr-2011-The Stationary Subspace Analysis Toolbox
Author: Jan Saputra Müller, Paul von Bünau, Frank C. Meinecke, Franz J. Király, Klaus-Robert Müller
Abstract: The Stationary Subspace Analysis (SSA) algorithm linearly factorizes a high-dimensional time series into stationary and non-stationary components. The SSA Toolbox is a platform-independent efficient stand-alone implementation of the SSA algorithm with a graphical user interface written in Java, that can also be invoked from the command line and from Matlab. The graphical interface guides the user through the whole process; data can be imported and exported from comma separated values (CSV) and Matlab’s .mat files. Keywords: non-stationarities, blind source separation, dimensionality reduction, unsupervised learning
5 0.031157767 83 jmlr-2011-Scikit-learn: Machine Learning in Python
Author: Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net. Keywords: Python, supervised learning, unsupervised learning, model selection
6 0.024883278 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
7 0.022714034 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
8 0.019158266 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms
9 0.018100997 55 jmlr-2011-Learning Multi-modal Similarity
10 0.017551202 93 jmlr-2011-The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets
11 0.015173342 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
12 0.013056908 48 jmlr-2011-Kernel Analysis of Deep Networks
13 0.012713291 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
14 0.012524067 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
15 0.012467603 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
16 0.011102566 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
17 0.011011547 105 jmlr-2011-lp-Norm Multiple Kernel Learning
18 0.010998239 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
19 0.010534445 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
20 0.01005088 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
topicId topicWeight
[(0, 0.05), (1, -0.027), (2, 0.006), (3, -0.038), (4, -0.037), (5, 0.004), (6, 0.007), (7, -0.013), (8, -0.016), (9, -0.015), (10, -0.075), (11, -0.007), (12, 0.165), (13, 0.011), (14, -0.293), (15, -0.091), (16, 0.052), (17, 0.073), (18, 0.127), (19, 0.099), (20, 0.015), (21, -0.12), (22, 0.064), (23, -0.05), (24, 0.021), (25, 0.06), (26, 0.056), (27, -0.051), (28, 0.044), (29, 0.039), (30, 0.058), (31, -0.042), (32, -0.062), (33, 0.013), (34, -0.009), (35, -0.125), (36, 0.088), (37, 0.085), (38, 0.093), (39, 0.03), (40, -0.074), (41, -0.055), (42, 0.043), (43, -0.072), (44, -0.085), (45, 0.17), (46, 0.188), (47, -0.097), (48, 0.088), (49, -0.324)]
simIndex simValue paperId paperTitle
same-paper 1 0.98161656 102 jmlr-2011-Waffles: A Machine Learning Toolkit
Author: Michael Gashler
Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source
2 0.495718 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas
Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classification, ranking, thresholding, dimensionality reduction, hierarchical classification, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a finite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classification and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The first group of methods are algorithm independent. They transform the learning task into one or more singlelabel classification tasks, for which a large body of learning algorithms exists. The second group of methods extend specific learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classifiers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the benefits of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientific progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classification and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classification, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of specific multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The first thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text files for the specification of a data set. The first one is in the ARFF format of Weka. The labels should be specified as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second file is in XML format. It specifies the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML file by nesting the label tag. In our example, the two filenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(
3 0.49561355 50 jmlr-2011-LPmade: Link Prediction Made Easy
Author: Ryan N. Lichtenwalter, Nitesh V. Chawla
Abstract: LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting highperformance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction models with a version of WEKA modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to generate plots, gigabytes or terabytes of output, and actionable or publishable results. Keywords: link prediction, network analysis, multicore, GNU make, PropFlow, HPLP
4 0.34220776 83 jmlr-2011-Scikit-learn: Machine Learning in Python
Author: Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net. Keywords: Python, supervised learning, unsupervised learning, model selection
5 0.23009542 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
Author: Fabien Lauer, Yann Guermeur
Abstract: This paper describes MSVMpack, an open source software package dedicated to our generic model of multi-class support vector machine. All four multi-class support vector machines (M-SVMs) proposed so far in the literature appear as instances of this model. MSVMpack provides for them the first unified implementation and offers a convenient basis to develop other instances. This is also the first parallel implementation for M-SVMs. The package consists in a set of command-line tools with a callable library. The documentation includes a tutorial, a user’s guide and a developer’s guide. Keywords: multi-class support vector machines, open source, C
6 0.22541197 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms
7 0.17393117 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
8 0.15990362 12 jmlr-2011-Bayesian Co-Training
9 0.14081667 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
10 0.1248944 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
11 0.11777882 55 jmlr-2011-Learning Multi-modal Similarity
12 0.11662994 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
13 0.11174442 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
14 0.10750522 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
15 0.10397841 92 jmlr-2011-The Stationary Subspace Analysis Toolbox
16 0.099412635 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
17 0.096847974 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
18 0.084676728 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
19 0.081310913 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
20 0.079498887 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
topicId topicWeight
[(4, 0.021), (9, 0.023), (10, 0.017), (13, 0.012), (14, 0.431), (24, 0.017), (31, 0.061), (32, 0.066), (36, 0.045), (41, 0.023), (60, 0.038), (66, 0.018), (71, 0.015), (73, 0.03), (78, 0.033), (90, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.74906659 102 jmlr-2011-Waffles: A Machine Learning Toolkit
Author: Michael Gashler
Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source
2 0.24275525 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki
Abstract: We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classifiers. Results on a large suite of benchmark data sets from the UCI repository show that ˆ fCLL-trained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources. Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classification, approximation c 2011 Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira and Petri Myllym¨ ki. a ¨ C ARVALHO , ROOS , O LIVEIRA AND M YLLYM AKI
3 0.24148656 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
4 0.22794636 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng
Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a refined analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classifier
5 0.22205307 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas
Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classification, ranking, thresholding, dimensionality reduction, hierarchical classification, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a finite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classification and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The first group of methods are algorithm independent. They transform the learning task into one or more singlelabel classification tasks, for which a large body of learning algorithms exists. The second group of methods extend specific learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classifiers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the benefits of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientific progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classification and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classification, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of specific multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The first thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text files for the specification of a data set. The first one is in the ARFF format of Weka. The labels should be specified as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second file is in XML format. It specifies the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML file by nesting the label tag. In our example, the two filenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(
6 0.22173686 50 jmlr-2011-LPmade: Link Prediction Made Easy
7 0.21289079 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
8 0.21016556 48 jmlr-2011-Kernel Analysis of Deep Networks
9 0.20462751 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
10 0.20112398 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
11 0.19698532 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
12 0.19689457 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
13 0.19596168 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
14 0.19440238 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
15 0.19332093 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
16 0.19254491 91 jmlr-2011-The Sample Complexity of Dictionary Learning
17 0.1922449 12 jmlr-2011-Bayesian Co-Training
18 0.1905237 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
19 0.18968485 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
20 0.1891029 61 jmlr-2011-Logistic Stick-Breaking Process