jmlr jmlr2011 jmlr2011-50 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ryan N. Lichtenwalter, Nitesh V. Chawla
Abstract: LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting highperformance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction models with a version of WEKA modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to generate plots, gigabytes or terabytes of output, and actionable or publishable results. Keywords: link prediction, network analysis, multicore, GNU make, PropFlow, HPLP
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Computer Science University of Notre Dame Notre Dame, IN 46556, USA Editor: Geoff Holmes Abstract LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. [sent-5, score-0.528]
2 Its first principal contribution is a scalable network library supporting highperformance implementations of the most commonly employed unsupervised link prediction methods. [sent-6, score-0.958]
3 Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction models with a version of WEKA modified to operate effectively on extremely large data sets. [sent-8, score-0.691]
4 With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to generate plots, gigabytes or terabytes of output, and actionable or publishable results. [sent-9, score-0.405]
5 Keywords: link prediction, network analysis, multicore, GNU make, PropFlow, HPLP 1. [sent-10, score-0.487]
6 Introduction Link prediction is succinctly stated as the problem of identifying yet-unobserved links in a network. [sent-11, score-0.108]
7 This task is of increasing interest in both research and corporate contexts. [sent-12, score-0.067]
8 Virtually every major conference and journal in data mining or machine learning now has a significant network science component, and these often include treatments of link prediction. [sent-13, score-0.487]
9 Further, even for standard prediction algorithms, researchers must often write new code or cobble together existing code fragments. [sent-16, score-0.259]
10 The work flow to achieve predictions and fair evaluation is time-consuming, challenging, and error-prone. [sent-17, score-0.132]
11 LPmade is the first library to focus on link prediction specifically, incorporating general and extensible forms of the predictors introduced by Liben-Nowell and Kleinberg (2007). [sent-18, score-0.767]
12 It also streamlines and parameterizes the complex link prediction work flow so that researchers can start with source data and achieve predictions in minimal time. [sent-19, score-0.711]
13 Some offer extreme generality, some offer extreme efficiency, some offer modeling utilities, and some have a dizzying array of algorithms. [sent-21, score-0.242]
14 Its software components are, by necessity, designed for high performance, and it offers a wide array of graph analysis algorithms, but it is first and foremost c 2011 Ryan N. [sent-23, score-0.147]
15 L ICHTENWALTER AND C HAWLA an extensive toolkit for performing link prediction to achieve both research and application goals. [sent-26, score-0.522]
16 Unlike other options, LPmade provides an organized collection of link prediction algorithms in a build framework that is accessible to researchers across many disciplines. [sent-27, score-0.627]
17 The Software Package The purpose of LPmade is to provide a workbench on which others may conduct link prediction research and applications. [sent-31, score-0.488]
18 For link prediction tasks in many large networks even a restricted set of predictions may involve millions, billions, or even trillions of lines of output. [sent-32, score-0.544]
19 Each unsupervised link prediction method, the supervised classification framework from Lichtenwalter et al. [sent-33, score-0.576]
20 (2010), and all the evaluation tools are optimized for just such quantities of data. [sent-34, score-0.083]
21 Nonetheless, the entire process of starting from raw source data and ending with predictions, evaluations, and plots involves an extensive series of steps that may each take a long time. [sent-35, score-0.245]
22 The software includes a carefully constructed dependency tracking system that minimizes overhead and simplifies the management of correct procedures. [sent-36, score-0.167]
23 Both the build system and the link prediction library are modular and extensible. [sent-37, score-0.833]
24 Researchers can incorporate their own prediction methods into the library and the automation framework just by writing a C++ class and changing a make variable. [sent-38, score-0.461]
25 The library includes clearly written yet optimized versions of the most common asymptotically optimal network analysis algorithms for sampling, finding connected components, computing centrality measures, and calculating useful statistics. [sent-41, score-0.354]
26 LPmade specializes in link prediction by including commonly used unsupervised link prediction methods: Adamic/Adar, common neighbors, Jaccard’s coefficient, Katz, preferential attachment, PropFlow, rooted PageRank, SimRank, and weighted rooted PageRank. [sent-42, score-1.167]
27 The library also has some simpler methods useful in producing feature vectors for supervised learners: clustering coefficient, geodesic distance, degree, PageRank, volume or gregariousness, mutuality, path count, and shortest path count. [sent-43, score-0.256]
28 These methods may be selectively incorporated as features into the supervised framework by Lichtenwalter et al. [sent-44, score-0.111]
29 Several graph libraries such as the Boost Graph Library are brilliantly designed for maximum generality and flexibility with template parameters and complex inheritance models. [sent-46, score-0.156]
30 One minor drawback to such libraries is that the code is complex to read and modify. [sent-47, score-0.118]
31 The code base for this library takes a narrower approach by offering fewer mechanisms for generality, but as a result it has a much shallower learning curve. [sent-48, score-0.326]
32 2 GNU make Script and Supporting Tools Although it can be used and extended as such, LPmade is not just a library of C++ code for network analysis and link prediction. [sent-50, score-0.777]
33 It is additionally an extensive set of scripts designed for sophisticated automation and dependency resolution. [sent-51, score-0.235]
34 These scripts are all incorporated into a set of 2 co-dependent Makefiles: task-specific and common. [sent-52, score-0.081]
35 Each step involves multiple invocations of many programs to properly assemble data and perform fair evaluation. [sent-55, score-0.041]
36 specific Makefile, which generally requires less than 20 lines of user code. [sent-56, score-0.044]
37 This Makefile is where users specify the manner in which raw source data is converted to the initial data stream required by subsequent steps in the pipeline. [sent-57, score-0.3]
38 It is also where rules from the common Makefile can be overridden for task-specific reasons. [sent-58, score-0.042]
39 common, includes all the general rules that apply to any network analysis or link prediction task once the task-specific Makefile is written to enable proper handling of raw input. [sent-60, score-0.812]
40 The common Makefile script is designed with advanced template features that allow make to modify original Makefile rules in accordance with user requirements. [sent-61, score-0.286]
41 Logical tasks are aggressively provided with their own rules so that the multi-core features of GNU make are of optimal benefit. [sent-62, score-0.072]
42 In general, users need not be familiar with writing Makefiles. [sent-63, score-0.078]
43 The important options for the behavior of the automatic build system are presented at the top of the common Makefile along with documentation. [sent-64, score-0.167]
44 Each rule in the Makefile script with no outstanding prerequisites is handled by a separate process to make use of additional cores. [sent-67, score-0.113]
45 For many large networks, link prediction and supporting analysis yields very large output files. [sent-68, score-0.607]
46 When this prolific output is further combined into data sets, both the I/O capacity and bandwidth requirements may be problematic. [sent-69, score-0.148]
47 To combat this, most steps in the work flow create, accept, and output gzip-compressed results. [sent-70, score-0.075]
48 Especially on multi-core systems, this results in a hefty decrease in I/O capacity and bandwidth requirements with a minimal impact on performance. [sent-71, score-0.106]
49 In most cases, the output from gunzip is produced faster than the consuming process can accept it. [sent-72, score-0.097]
50 Where necessary, named pipes are used to ameliorate potentially large temporary storage requirements. [sent-73, score-0.066]
51 3 WEKA Modifications LPmade includes a modified version of WEKA 3. [sent-75, score-0.036]
52 Instead the build system uses WEKA classifier implementations to construct 2491 L ICHTENWALTER AND C HAWLA supervised models for link prediction. [sent-79, score-0.559]
53 Unmodified, WEKA has several limitations that make even its command-line mode problematic for operation on enormous link prediction testing sets. [sent-80, score-0.518]
54 These include processing overhead for unwanted computations, Java string overflow and potential thrashing from in-memory result concatenation, and inability to handle compressed C4. [sent-81, score-0.076]
55 Alternatives such as MOA solve some but not all of these problems, and WEKA internal classes such as AbstractOutput are unavailable at the command line. [sent-83, score-0.033]
56 We have chosen to modify the WEKA command-line evaluation path to compute only the necessary information and to output directly to standard output for LPmade scripted downstream processing. [sent-84, score-0.185]
57 5 input and use this support in the build system to take advantage of significant space savings on disk. [sent-86, score-0.134]
58 The network library includes an easily extended testing architecture for testing and verification of individual binaries. [sent-89, score-0.419]
59 The C++ library is written in platform-independent C++ code using only STL extensions. [sent-90, score-0.26]
60 The library may thus be built on any architecture and any operating system that provides a C++ compiler. [sent-91, score-0.324]
61 An included set of high-speed evaluation tools is written in C99 and builds on any system with such a compiler. [sent-92, score-0.131]
62 The bundled distribution of WEKA is cross-platform but requires version 1. [sent-93, score-0.079]
63 The common Makefile additionally employs many standard tools such as cut, paste, sed, awk, perl, sort, gzip, and bundled gnuplot 4. [sent-96, score-0.127]
64 Acknowledgments Research was sponsored in part by the Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053 and in part by the National Science Foundation Grant BCS-0826958. [sent-99, score-0.033]
wordName wordTfidf (topN-words)
[('lpmade', 0.55), ('link', 0.38), ('lichtenwalter', 0.236), ('weka', 0.221), ('library', 0.211), ('raw', 0.139), ('gnu', 0.138), ('le', 0.136), ('nitesh', 0.118), ('prediction', 0.108), ('network', 0.107), ('ryan', 0.1), ('build', 0.086), ('stream', 0.083), ('cores', 0.083), ('script', 0.083), ('automates', 0.079), ('bundled', 0.079), ('dame', 0.079), ('hawla', 0.079), ('ichtenwalter', 0.079), ('notre', 0.079), ('pagerank', 0.079), ('propflow', 0.079), ('streamlines', 0.079), ('automation', 0.077), ('supporting', 0.077), ('libraries', 0.069), ('ow', 0.068), ('corporate', 0.067), ('witten', 0.067), ('architecture', 0.065), ('longitudinal', 0.06), ('predictions', 0.056), ('accept', 0.055), ('growth', 0.055), ('researchers', 0.053), ('template', 0.051), ('code', 0.049), ('scripts', 0.048), ('system', 0.048), ('tools', 0.048), ('offer', 0.046), ('supervised', 0.045), ('user', 0.044), ('users', 0.043), ('overhead', 0.043), ('manual', 0.043), ('unsupervised', 0.043), ('evaluations', 0.043), ('rules', 0.042), ('output', 0.042), ('boost', 0.041), ('rooted', 0.041), ('fair', 0.041), ('bandwidth', 0.04), ('software', 0.04), ('sophisticated', 0.04), ('array', 0.038), ('format', 0.038), ('plots', 0.037), ('designed', 0.036), ('includes', 0.036), ('writing', 0.035), ('predictors', 0.035), ('evaluation', 0.035), ('source', 0.035), ('extensive', 0.034), ('extreme', 0.033), ('requirements', 0.033), ('utilities', 0.033), ('jon', 0.033), ('army', 0.033), ('selectively', 0.033), ('extensible', 0.033), ('specializes', 0.033), ('ameliorate', 0.033), ('combat', 0.033), ('downstream', 0.033), ('eibe', 0.033), ('foremost', 0.033), ('geoff', 0.033), ('inability', 0.033), ('jaccard', 0.033), ('narrower', 0.033), ('parallelism', 0.033), ('paste', 0.033), ('preferential', 0.033), ('scripted', 0.033), ('shallower', 0.033), ('sponsored', 0.033), ('temporary', 0.033), ('terabytes', 0.033), ('incorporated', 0.033), ('capacity', 0.033), ('internal', 0.033), ('options', 0.033), ('scalable', 0.032), ('paths', 0.031), ('make', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 50 jmlr-2011-LPmade: Link Prediction Made Easy
Author: Ryan N. Lichtenwalter, Nitesh V. Chawla
Abstract: LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting highperformance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction models with a version of WEKA modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to generate plots, gigabytes or terabytes of output, and actionable or publishable results. Keywords: link prediction, network analysis, multicore, GNU make, PropFlow, HPLP
2 0.10519243 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas
Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classification, ranking, thresholding, dimensionality reduction, hierarchical classification, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a finite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classification and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The first group of methods are algorithm independent. They transform the learning task into one or more singlelabel classification tasks, for which a large body of learning algorithms exists. The second group of methods extend specific learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classifiers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the benefits of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientific progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classification and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classification, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of specific multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The first thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text files for the specification of a data set. The first one is in the ARFF format of Weka. The labels should be specified as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second file is in XML format. It specifies the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML file by nesting the label tag. In our example, the two filenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(
3 0.064670347 102 jmlr-2011-Waffles: A Machine Learning Toolkit
Author: Michael Gashler
Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source
4 0.034959294 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano
Abstract: We present a parallel algorithm for the score-based optimal structure search of Bayesian networks. This algorithm is based on a dynamic programming (DP) algorithm having O(n · 2n ) time and space complexity, which is known to be the fastest algorithm for the optimal structure search of networks with n nodes. The bottleneck of the problem is the memory requirement, and therefore, the algorithm is currently applicable for up to a few tens of nodes. While the recently proposed algorithm overcomes this limitation by a space-time trade-off, our proposed algorithm realizes direct parallelization of the original DP algorithm with O(nσ ) time and space overhead calculations, where σ > 0 controls the communication-space trade-off. The overall time and space complexity is O(nσ+1 2n ). This algorithm splits the search space so that the required communication between independent calculations is minimal. Because of this advantage, our algorithm can run on distributed memory supercomputers. Through computational experiments, we confirmed that our algorithm can run in parallel using up to 256 processors with a parallelization efficiency of 0.74, compared to the original DP algorithm with a single processor. We also demonstrate optimal structure search for a 32-node network without any constraints, which is the largest network search presented in literature. Keywords: optimal Bayesian network structure, parallel algorithm
5 0.034936007 83 jmlr-2011-Scikit-learn: Machine Learning in Python
Author: Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net. Keywords: Python, supervised learning, unsupervised learning, model selection
6 0.034860071 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
7 0.034314707 92 jmlr-2011-The Stationary Subspace Analysis Toolbox
8 0.033456415 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
9 0.032560546 48 jmlr-2011-Kernel Analysis of Deep Networks
10 0.030710522 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
11 0.028868554 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
12 0.027411535 93 jmlr-2011-The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets
13 0.025296099 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
14 0.025211688 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
15 0.024798321 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
16 0.024721127 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
17 0.024164902 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
18 0.023461005 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
19 0.021467503 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
20 0.020547599 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
topicId topicWeight
[(0, 0.1), (1, -0.056), (2, -0.003), (3, -0.043), (4, -0.073), (5, 0.018), (6, 0.021), (7, -0.001), (8, 0.005), (9, 0.014), (10, -0.121), (11, -0.03), (12, 0.232), (13, 0.046), (14, -0.337), (15, -0.145), (16, -0.001), (17, 0.057), (18, 0.146), (19, 0.181), (20, 0.015), (21, -0.125), (22, 0.048), (23, 0.009), (24, -0.001), (25, 0.011), (26, 0.137), (27, -0.163), (28, -0.039), (29, 0.118), (30, -0.05), (31, -0.009), (32, -0.054), (33, -0.082), (34, 0.118), (35, -0.147), (36, -0.043), (37, 0.014), (38, 0.12), (39, 0.162), (40, -0.021), (41, -0.02), (42, -0.058), (43, -0.133), (44, 0.033), (45, -0.099), (46, -0.076), (47, 0.1), (48, 0.008), (49, -0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.98369157 50 jmlr-2011-LPmade: Link Prediction Made Easy
Author: Ryan N. Lichtenwalter, Nitesh V. Chawla
Abstract: LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting highperformance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction models with a version of WEKA modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to generate plots, gigabytes or terabytes of output, and actionable or publishable results. Keywords: link prediction, network analysis, multicore, GNU make, PropFlow, HPLP
2 0.88843775 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas
Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classification, ranking, thresholding, dimensionality reduction, hierarchical classification, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a finite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classification and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The first group of methods are algorithm independent. They transform the learning task into one or more singlelabel classification tasks, for which a large body of learning algorithms exists. The second group of methods extend specific learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classifiers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the benefits of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientific progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classification and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classification, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of specific multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The first thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text files for the specification of a data set. The first one is in the ARFF format of Weka. The labels should be specified as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second file is in XML format. It specifies the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML file by nesting the label tag. In our example, the two filenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(
3 0.57522291 102 jmlr-2011-Waffles: A Machine Learning Toolkit
Author: Michael Gashler
Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source
4 0.20069388 48 jmlr-2011-Kernel Analysis of Deep Networks
Author: Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller
Abstract: When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer. Keywords: deep networks, kernel principal component analysis, representations
5 0.19117528 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano
Abstract: We present a parallel algorithm for the score-based optimal structure search of Bayesian networks. This algorithm is based on a dynamic programming (DP) algorithm having O(n · 2n ) time and space complexity, which is known to be the fastest algorithm for the optimal structure search of networks with n nodes. The bottleneck of the problem is the memory requirement, and therefore, the algorithm is currently applicable for up to a few tens of nodes. While the recently proposed algorithm overcomes this limitation by a space-time trade-off, our proposed algorithm realizes direct parallelization of the original DP algorithm with O(nσ ) time and space overhead calculations, where σ > 0 controls the communication-space trade-off. The overall time and space complexity is O(nσ+1 2n ). This algorithm splits the search space so that the required communication between independent calculations is minimal. Because of this advantage, our algorithm can run on distributed memory supercomputers. Through computational experiments, we confirmed that our algorithm can run in parallel using up to 256 processors with a parallelization efficiency of 0.74, compared to the original DP algorithm with a single processor. We also demonstrate optimal structure search for a 32-node network without any constraints, which is the largest network search presented in literature. Keywords: optimal Bayesian network structure, parallel algorithm
6 0.18807951 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
7 0.17786708 58 jmlr-2011-Learning from Partial Labels
8 0.16660024 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
9 0.16460587 93 jmlr-2011-The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets
10 0.16011049 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
11 0.14940557 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
12 0.13935208 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
13 0.13622884 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
14 0.1359672 92 jmlr-2011-The Stationary Subspace Analysis Toolbox
15 0.13315511 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
16 0.125093 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
17 0.12494544 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
18 0.12168486 61 jmlr-2011-Logistic Stick-Breaking Process
19 0.12085337 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
20 0.11222591 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
topicId topicWeight
[(4, 0.034), (9, 0.017), (10, 0.039), (24, 0.022), (31, 0.068), (32, 0.044), (36, 0.559), (41, 0.017), (60, 0.029), (73, 0.017), (78, 0.03), (90, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.87651193 50 jmlr-2011-LPmade: Link Prediction Made Easy
Author: Ryan N. Lichtenwalter, Nitesh V. Chawla
Abstract: LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting highperformance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction models with a version of WEKA modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to generate plots, gigabytes or terabytes of output, and actionable or publishable results. Keywords: link prediction, network analysis, multicore, GNU make, PropFlow, HPLP
2 0.28800881 102 jmlr-2011-Waffles: A Machine Learning Toolkit
Author: Michael Gashler
Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source
3 0.20975403 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
Author: Fabien Lauer, Yann Guermeur
Abstract: This paper describes MSVMpack, an open source software package dedicated to our generic model of multi-class support vector machine. All four multi-class support vector machines (M-SVMs) proposed so far in the literature appear as instances of this model. MSVMpack provides for them the first unified implementation and offers a convenient basis to develop other instances. This is also the first parallel implementation for M-SVMs. The package consists in a set of command-line tools with a callable library. The documentation includes a tutorial, a user’s guide and a developer’s guide. Keywords: multi-class support vector machines, open source, C
4 0.19701606 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms
Author: Volodymyr Melnykov, Ranjan Maitra
Abstract: This paper presents the C LUSTERING A LGORITHMS ’ R EFEREE PACKAGE or CARP, an open source GNU GPL-licensed C package for evaluating clustering algorithms. Calibrating performance of such algorithms is important and CARP addresses this need by generating datasets of different clustering complexity and by assessing the performance of the concerned algorithm in terms of its ability to classify each dataset relative to the true grouping. This paper briefly describes the software and its capabilities. Keywords: CARP, M IX S IM, clustering algorithm, Gaussian mixture, overlap
5 0.1884084 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
6 0.17743918 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
7 0.17727032 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
8 0.17599013 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
9 0.16998486 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
10 0.16720949 48 jmlr-2011-Kernel Analysis of Deep Networks
11 0.16575845 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
12 0.16535804 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
13 0.16525257 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
14 0.16354744 16 jmlr-2011-Clustering Algorithms for Chains
15 0.16236497 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
16 0.16186339 12 jmlr-2011-Bayesian Co-Training
17 0.16108765 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
18 0.16028313 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
19 0.15960784 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
20 0.15942821 95 jmlr-2011-Training SVMs Without Offset