jmlr jmlr2011 jmlr2011-63 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas
Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classification, ranking, thresholding, dimensionality reduction, hierarchical classification, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a finite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classification and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The first group of methods are algorithm independent. They transform the learning task into one or more singlelabel classification tasks, for which a large body of learning algorithms exists. The second group of methods extend specific learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classifiers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the benefits of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientific progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classification and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classification, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of specific multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The first thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text files for the specification of a data set. The first one is in the ARFF format of Weka. The labels should be specified as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second file is in XML format. It specifies the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML file by nesting the label tag. In our example, the two filenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(
Reference: text
sentIndex sentText sentNum sentScore
1 GR Department of Informatics Aristotle University of Thessaloniki Thessaloniki 54124, Greece Editor: Cheng Soon Ong Abstract M ULAN is a Java library for learning from multi-label data. [sent-8, score-0.112]
2 It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. [sent-9, score-0.2]
3 In addition, it contains an evaluation framework that calculates a rich variety of performance measures. [sent-10, score-0.061]
4 Keywords: multi-label data, classification, ranking, thresholding, dimensionality reduction, hierarchical classification, evaluation 1. [sent-11, score-0.101]
5 They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. [sent-14, score-0.098]
6 The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. [sent-17, score-0.215]
7 The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. [sent-18, score-0.181]
8 This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al. [sent-21, score-0.133]
9 When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. [sent-30, score-0.081]
10 Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek and Ioannis Vlahavas. [sent-31, score-0.033]
11 The M ULAN Library The main goal of M ULAN is to bring the benefits of machine learning open source software (MLOSS) (Sonnenburg et al. [sent-36, score-0.028]
12 The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientific progress. [sent-38, score-0.101]
13 In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. [sent-39, score-0.093]
14 Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. [sent-40, score-0.028]
15 Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classification and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. [sent-41, score-0.324]
16 In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classification, including an implemented algorithm. [sent-42, score-0.389]
17 As such, it offers only programmatic API to the library users. [sent-44, score-0.158]
18 The possibility to use the library via command line, is also currently not supported. [sent-46, score-0.112]
19 However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. [sent-51, score-0.112]
20 M ULAN is an advocate of open science in general. [sent-52, score-0.056]
21 One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. [sent-53, score-0.143]
22 There are currently only a number of implementations of specific multi-label learning algorithms, but not a general library like M ULAN. [sent-55, score-0.112]
23 Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. [sent-57, score-0.027]
24 We create a new Java class for this experiment, which we call MulanExp1. [sent-58, score-0.029]
25 The labels should be specified as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. [sent-63, score-0.093]
26 It specifies the labels and any hierarchical relationships among them. [sent-65, score-0.081]
27 Hierarchies of labels can be expressed in the XML file by nesting the label tag. [sent-66, score-0.121]
28 In our example, the two filenames are given to the experiment class through command-line parameters. [sent-67, score-0.027]
29 MultiLabelInstances data = new MultiLabelInstances(arffFile, xmlFile); The next step is to create an instance from each of the two learners that we want to evaluate. [sent-71, score-0.063]
30 We will create an instance of the RAkEL and MLkNN algorithms. [sent-72, score-0.029]
31 In turn LP is a transformation-based algorithm and it accepts a single-label classifier as a parameter. [sent-74, score-0.031]
32 RAkEL learner1 = new RAkEL(new LabelPowerset(new J48())); MLkNN learner2 = new MLkNN(); We then declare an Evaluator object that handles empirical evaluations and an object of the MultipleEvaluation class that stores cross-validation results. [sent-77, score-0.132]
33 Evaluator eval = new Evaluator(); MultipleEvaluation results; To actually perform the evaluations we use the crossValidate method of the Evaluator class. [sent-78, score-0.027]
34 This returns a MultipleEvaluation object, which we can print to see the results in terms of all applicable evaluation measures available in M ULAN. [sent-79, score-0.061]
35 println(results); For running the experiment, we can use the emotions data (emotions. [sent-86, score-0.031]
36 Other open access multi-label data sets can be found at http://mulan. [sent-89, score-0.028]
37 Assuming the experiment’s source file is in the same directory with emotions. [sent-93, score-0.031]
38 jar from the distribution package, then to run this experiment we type the following commands (under Linux use : instead of ; as path separator). [sent-97, score-0.055]
39 Documentation, Requirements and Availability M ULAN’s online documentation1 contains user oriented sections, such as getting started with M U LAN and the data set format of M ULAN , as well as developer-oriented sections, such as extending M ULAN, API reference and running tests. [sent-109, score-0.036]
40 There is also a mailing list for requesting support on using or extending M ULAN. [sent-110, score-0.059]
41 Acknowledgments We would like to thank several people that have contributed pieces of code to the library. [sent-120, score-0.046]
42 The need for open source software in machine learning. [sent-141, score-0.028]
wordName wordTfidf (topN-words)
[('ulan', 0.666), ('tsoumakas', 0.189), ('ioannis', 0.185), ('weka', 0.182), ('java', 0.159), ('evaluator', 0.148), ('grigorios', 0.148), ('mlknn', 0.148), ('rakel', 0.148), ('library', 0.112), ('auth', 0.111), ('bipartition', 0.111), ('ioannou', 0.111), ('multipleevaluation', 0.111), ('numfolds', 0.111), ('xml', 0.111), ('csd', 0.094), ('api', 0.078), ('ranking', 0.077), ('arff', 0.074), ('arfffile', 0.074), ('args', 0.074), ('bipartitions', 0.074), ('eleftherios', 0.074), ('ilcek', 0.074), ('ioufis', 0.074), ('jozef', 0.074), ('lahavas', 0.074), ('multilabelinstances', 0.074), ('pyromitros', 0.074), ('sakkas', 0.074), ('soumakas', 0.074), ('thessaloniki', 0.074), ('vilcek', 0.074), ('vlahavas', 0.074), ('xmlfile', 0.074), ('witten', 0.063), ('katakis', 0.063), ('mloss', 0.063), ('cheng', 0.059), ('thresholding', 0.056), ('gr', 0.052), ('labels', 0.05), ('people', 0.046), ('offers', 0.046), ('label', 0.043), ('package', 0.042), ('sonnenburg', 0.041), ('availability', 0.041), ('categorization', 0.039), ('string', 0.039), ('le', 0.038), ('ong', 0.037), ('dimensionality', 0.037), ('object', 0.037), ('format', 0.036), ('learners', 0.034), ('burden', 0.033), ('evaluation', 0.033), ('reduction', 0.033), ('powerset', 0.031), ('oded', 0.031), ('eibe', 0.031), ('declare', 0.031), ('genres', 0.031), ('reproduces', 0.031), ('emotions', 0.031), ('accepts', 0.031), ('alphabetical', 0.031), ('crossvalidate', 0.031), ('directory', 0.031), ('greg', 0.031), ('gui', 0.031), ('mailing', 0.031), ('nowadays', 0.031), ('plethora', 0.031), ('hierarchical', 0.031), ('create', 0.029), ('lior', 0.028), ('commands', 0.028), ('abel', 0.028), ('nesting', 0.028), ('yann', 0.028), ('advocate', 0.028), ('gpl', 0.028), ('genomics', 0.028), ('leon', 0.028), ('platforms', 0.028), ('print', 0.028), ('requesting', 0.028), ('lp', 0.028), ('soon', 0.028), ('classi', 0.028), ('open', 0.028), ('variety', 0.028), ('experiment', 0.027), ('evaluations', 0.027), ('query', 0.027), ('concerned', 0.027), ('implementing', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas
Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classification, ranking, thresholding, dimensionality reduction, hierarchical classification, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a finite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classification and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The first group of methods are algorithm independent. They transform the learning task into one or more singlelabel classification tasks, for which a large body of learning algorithms exists. The second group of methods extend specific learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classifiers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the benefits of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientific progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classification and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classification, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of specific multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The first thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text files for the specification of a data set. The first one is in the ARFF format of Weka. The labels should be specified as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second file is in XML format. It specifies the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML file by nesting the label tag. In our example, the two filenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(
2 0.10519243 50 jmlr-2011-LPmade: Link Prediction Made Easy
Author: Ryan N. Lichtenwalter, Nitesh V. Chawla
Abstract: LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting highperformance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction models with a version of WEKA modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to generate plots, gigabytes or terabytes of output, and actionable or publishable results. Keywords: link prediction, network analysis, multicore, GNU make, PropFlow, HPLP
3 0.044309702 102 jmlr-2011-Waffles: A Machine Learning Toolkit
Author: Michael Gashler
Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source
4 0.037248481 92 jmlr-2011-The Stationary Subspace Analysis Toolbox
Author: Jan Saputra Müller, Paul von Bünau, Frank C. Meinecke, Franz J. Király, Klaus-Robert Müller
Abstract: The Stationary Subspace Analysis (SSA) algorithm linearly factorizes a high-dimensional time series into stationary and non-stationary components. The SSA Toolbox is a platform-independent efficient stand-alone implementation of the SSA algorithm with a graphical user interface written in Java, that can also be invoked from the command line and from Matlab. The graphical interface guides the user through the whole process; data can be imported and exported from comma separated values (CSV) and Matlab’s .mat files. Keywords: non-stationarities, blind source separation, dimensionality reduction, unsupervised learning
5 0.032865107 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
Author: Fabien Lauer, Yann Guermeur
Abstract: This paper describes MSVMpack, an open source software package dedicated to our generic model of multi-class support vector machine. All four multi-class support vector machines (M-SVMs) proposed so far in the literature appear as instances of this model. MSVMpack provides for them the first unified implementation and offers a convenient basis to develop other instances. This is also the first parallel implementation for M-SVMs. The package consists in a set of command-line tools with a callable library. The documentation includes a tutorial, a user’s guide and a developer’s guide. Keywords: multi-class support vector machines, open source, C
6 0.030595396 58 jmlr-2011-Learning from Partial Labels
7 0.028863307 83 jmlr-2011-Scikit-learn: Machine Learning in Python
8 0.027759826 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods
9 0.025961153 55 jmlr-2011-Learning Multi-modal Similarity
10 0.02411432 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
11 0.021519121 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
12 0.0211269 48 jmlr-2011-Kernel Analysis of Deep Networks
13 0.019181853 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
14 0.018327277 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
15 0.018099375 93 jmlr-2011-The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets
16 0.018013587 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
17 0.016997989 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
18 0.016461225 105 jmlr-2011-lp-Norm Multiple Kernel Learning
19 0.016287023 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
20 0.014851195 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
topicId topicWeight
[(0, 0.076), (1, -0.039), (2, 0.012), (3, -0.056), (4, -0.034), (5, 0.003), (6, -0.004), (7, -0.01), (8, -0.011), (9, -0.026), (10, -0.116), (11, -0.031), (12, 0.218), (13, -0.0), (14, -0.329), (15, -0.082), (16, -0.047), (17, 0.055), (18, 0.16), (19, 0.167), (20, -0.041), (21, -0.148), (22, -0.052), (23, 0.012), (24, -0.055), (25, 0.047), (26, 0.134), (27, -0.152), (28, 0.011), (29, 0.108), (30, -0.032), (31, 0.027), (32, -0.007), (33, -0.088), (34, 0.124), (35, -0.165), (36, 0.102), (37, -0.008), (38, 0.082), (39, 0.132), (40, -0.02), (41, -0.063), (42, -0.052), (43, -0.014), (44, -0.002), (45, -0.158), (46, -0.022), (47, 0.146), (48, 0.044), (49, -0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.96510136 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas
Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classification, ranking, thresholding, dimensionality reduction, hierarchical classification, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a finite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classification and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The first group of methods are algorithm independent. They transform the learning task into one or more singlelabel classification tasks, for which a large body of learning algorithms exists. The second group of methods extend specific learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classifiers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the benefits of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientific progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classification and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classification, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of specific multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The first thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text files for the specification of a data set. The first one is in the ARFF format of Weka. The labels should be specified as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second file is in XML format. It specifies the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML file by nesting the label tag. In our example, the two filenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(
2 0.881881 50 jmlr-2011-LPmade: Link Prediction Made Easy
Author: Ryan N. Lichtenwalter, Nitesh V. Chawla
Abstract: LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting highperformance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction models with a version of WEKA modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to generate plots, gigabytes or terabytes of output, and actionable or publishable results. Keywords: link prediction, network analysis, multicore, GNU make, PropFlow, HPLP
3 0.5423016 102 jmlr-2011-Waffles: A Machine Learning Toolkit
Author: Michael Gashler
Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source
4 0.21621363 58 jmlr-2011-Learning from Partial Labels
Author: Timothee Cour, Ben Sapp, Ben Taskar
Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces
5 0.19343251 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
Author: Şeyda Ertekin, Cynthia Rudin
Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classification algorithm that generalizes AdaBoost, which we call “P-Classification.” We discuss and validate the potential benefits of this equivalence relationship, and perform controlled experiments to understand P-Classification’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classification and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classification, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression
6 0.19092722 92 jmlr-2011-The Stationary Subspace Analysis Toolbox
7 0.1701913 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods
8 0.15584727 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
9 0.14237177 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
10 0.12213152 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
11 0.11977116 61 jmlr-2011-Logistic Stick-Breaking Process
12 0.11377694 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
13 0.11335326 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
14 0.10774518 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
15 0.10735296 48 jmlr-2011-Kernel Analysis of Deep Networks
16 0.10343481 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
17 0.10193329 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets
18 0.10068598 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
19 0.094430812 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
20 0.094386131 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
topicId topicWeight
[(4, 0.014), (9, 0.019), (10, 0.012), (24, 0.034), (31, 0.036), (32, 0.685), (36, 0.011), (60, 0.018), (73, 0.022), (78, 0.03), (90, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.92024934 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas
Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classification, ranking, thresholding, dimensionality reduction, hierarchical classification, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a finite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classification and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The first group of methods are algorithm independent. They transform the learning task into one or more singlelabel classification tasks, for which a large body of learning algorithms exists. The second group of methods extend specific learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classifiers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the benefits of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientific progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classification and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classification, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of specific multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The first thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text files for the specification of a data set. The first one is in the ARFF format of Weka. The labels should be specified as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second file is in XML format. It specifies the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML file by nesting the label tag. In our example, the two filenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(
2 0.74940813 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng
Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a refined analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classifier
3 0.73851949 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki
Abstract: We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classifiers. Results on a large suite of benchmark data sets from the UCI repository show that ˆ fCLL-trained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources. Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classification, approximation c 2011 Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira and Petri Myllym¨ ki. a ¨ C ARVALHO , ROOS , O LIVEIRA AND M YLLYM AKI
4 0.67336339 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
5 0.28092939 102 jmlr-2011-Waffles: A Machine Learning Toolkit
Author: Michael Gashler
Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source
6 0.24549815 48 jmlr-2011-Kernel Analysis of Deep Networks
7 0.22492726 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
8 0.22201024 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
9 0.2078131 50 jmlr-2011-LPmade: Link Prediction Made Easy
10 0.19687329 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
11 0.19605215 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
12 0.19467518 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data
13 0.18998757 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
14 0.18811852 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
15 0.1872568 91 jmlr-2011-The Sample Complexity of Dictionary Learning
16 0.18418325 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
17 0.183925 61 jmlr-2011-Logistic Stick-Breaking Process
18 0.18333034 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
19 0.18281364 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
20 0.18265402 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization