jmlr jmlr2011 jmlr2011-63 jmlr2011-63-reference knowledge-graph by maker-knowledge-mining

63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning

Source: pdf

Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas

Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classiﬁcation, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classiﬁcation, ranking, thresholding, dimensionality reduction, hierarchical classiﬁcation, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a ﬁnite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classiﬁcation and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The ﬁrst group of methods are algorithm independent. They transform the learning task into one or more singlelabel classiﬁcation tasks, for which a large body of learning algorithms exists. The second group of methods extend speciﬁc learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classiﬁers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xiouﬁs, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the beneﬁts of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientiﬁc progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classiﬁcation and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classiﬁcation, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of speciﬁc multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The ﬁrst thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text ﬁles for the speciﬁcation of a data set. The ﬁrst one is in the ARFF format of Weka. The labels should be speciﬁed as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second ﬁle is in XML format. It speciﬁes the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML ﬁle by nesting the label tag. In our example, the two ﬁlenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(

reference text

Marios Ioannou, George Sakkas, Grigorios Tsoumakas, and Ioannis Vlahavas. Obtaining bipartitions from score vectors for multi-label classiﬁcation. IEEE International Conference on Tools with Artiﬁcial Intelligence, 1:409–416, 2010. ISSN 1082-3409. S¨ ren Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey o Holmes, Yann LeCun, Klaus-Robert M¨ ller, Fernando Pereira, Carl Edward Rasmussen, Gunu nar R¨ tsch, Bernhard Sch¨ lkopf, Alexander Smola, Pascal Vincent, Jason Weston, and Robert a o Williamson. The need for open source software in machine learning. JMLR, 8:2443–2466, 2007. Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. In Oded Maimon and Lior Rokach, editors, Data Mining and Knowledge Discovery Handbook, chapter 34, pages 667–685. Springer, 2nd edition, 2010. Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2005. 1. Available at http://mulan.sourceforge.net/documentation.html. 2. Available for download at http://sourceforge.net/projects/mulan/. 2414