jmlr jmlr2010 jmlr2010-116 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten
Abstract: WEKA is a popular machine learning workbench with a development life of nearly two decades. This article provides an overview of the factors that we believe to be important to its success. Rather than focussing on the software’s functionality, we review aspects of project management and historical development decisions that likely had an impact on the uptake of the project. Keywords: machine learning software, open source software
Reference: text
sentIndex sentText sentNum sentScore
1 NZ Department of Computer Science University of Waikatoi Hamilton, New Zealand Editor: Soeren Sonnenburg Abstract WEKA is a popular machine learning workbench with a development life of nearly two decades. [sent-25, score-0.116]
2 Rather than focussing on the software’s functionality, we review aspects of project management and historical development decisions that likely had an impact on the uptake of the project. [sent-27, score-0.191]
3 Keywords: machine learning software, open source software 1. [sent-28, score-0.164]
4 Introduction We present a brief account of the WEKA 3 software, which is distributed under the GNU General Public License, followed by some lessons learned over the period spanning its development and maintenance. [sent-29, score-0.072]
5 We also include a brief historical mention of its predecessors. [sent-30, score-0.03]
6 WEKA contains implementations of algorithms for classification, clustering, and association rule mining, along with graphical user interfaces and visualization utilities for data exploration and algorithm evaluation. [sent-31, score-0.29]
7 This article shares some background on software design and management decisions, in the hope that it may prove useful to others involved in the development of open-source machine learning software. [sent-32, score-0.239]
8 (2009) give an overview of the system; more comprehensive sources of information are Witten and Frank’s book Data Mining (2005) and the user manuals included in the software distribution. [sent-34, score-0.312]
9 The wekalist mailing list is a forum for discussion of WEKA-related queries, with nearly 3000 subscribers. [sent-36, score-0.059]
10 WEKA is a machine learning workbench that supports many activities of machine learning practitioners. [sent-39, score-0.044]
11 Furthermore, WEKA includes meta-classifiers like bagging, boosting, stacking; multiple instance classifiers; and interfaces for classifiers implemented in Groovy and Jython. [sent-61, score-0.142]
12 Data can be inspected visually by plotting attribute values against the class, or against other attribute values. [sent-69, score-0.136]
13 For specific methods there are specialized tools for visualization, such as a tree viewer for any method that produces classification trees, a Bayes network viewer with automatic layout, and a dendrogram viewer for hierarchical clustering. [sent-71, score-0.198]
14 WEKA also includes support for association rule mining, comparing classifiers, data set generation, facilities for annotated documentation generation for source code, distribution estimation, and data conversion. [sent-72, score-0.069]
15 2 Graphical User Interfaces WEKA’s functionality can be accessed through various graphical user interfaces, principally the Explorer, and Experimenter interfaces shown in Figure 1, but also the Knowledge Flow interface. [sent-74, score-0.35]
16 The most popular interface, the Explorer, allows quick exploration of data and supports all the main items mentioned above—data loading and filtering, classification, clustering, attribute selection and various forms of visualization—in an interactive fashion. [sent-75, score-0.125]
17 The Knowledge Flow interface is a Java Beans application that allows the same kind of data exploration, processing and visualization as the Explorer (along with some extras), but in a workfloworiented system. [sent-80, score-0.091]
18 The user can define a workflow specifying how data is loaded, preprocessed, evaluated and visualized, which can be repeated multiple times. [sent-81, score-0.078]
19 WEKA also includes some specialized graphical interfaces, such as a Bayes network editor that focuses on Bayes network learning and inference, an SQL viewer for interaction with databases, and an ARFF data file viewer and editor. [sent-85, score-0.178]
20 All functionality and some more specialized functions can be accessed from a command line interface, so WEKA can be used without a windowing system. [sent-86, score-0.084]
21 3 Extending WEKA One of WEKA’s major strengths is that it is easily extended with customized or new classifiers, clusterers, attribute selection methods, and other components. [sent-88, score-0.068]
22 The code fragment in Figure 2 shows a minimal implementation of a classifier that returns the mean or mode of the class in the training set (double values are used to store indices of nominal attribute values). [sent-90, score-0.119]
23 Any new class is picked up by the graphical user interfaces through Java introspection: no further coding is needed to deploy it from WEKA’s graphical user interfaces. [sent-91, score-0.39]
24 This makes it easy to evaluate how new algorithms perform compared to any of the existing ones, which explains WEKA’s popularity among machine learning researchers. [sent-92, score-0.043]
25 Besides being easy to extend, WEKA includes a wide range of support for adding functionality to basic implementations. [sent-93, score-0.058]
26 For instance, a classifier can have various optional settings by implementing a pre-defined interface for option handling. [sent-94, score-0.067]
27 Each option can be documented using a tool-tip text 2535 B OUCKAERT, F RANK , H ALL , H OLMES , P FAHRINGER , R EUTEMANN AND W ITTEN p a c k a g e weka . [sent-95, score-0.792]
28 ∗ ; public class NewClassifier extends C l a s s i f i e r { d o u b l e m fMean ; public void b u i l d C l a s s i f i e r ( I n s t a n c e s data ) throws Exception { m fMean = d a t a . [sent-101, score-0.104]
29 c l a s s I n d e x ( ) ) ; } public double c l a s s i f y I n s t a n c e ( I n stan ce i n s t a n c e ) throws Exception { r e t u r n m fMean ; } } Figure 2: Classifier code example. [sent-103, score-0.12]
30 method, which is picked up by the help dialogs of the graphical user interfaces. [sent-104, score-0.124]
31 Some methods only apply to certain kinds of data, such as numeric class values or discrete attribute values. [sent-106, score-0.068]
32 A ‘capabilities’ mechanism allows classes to identify what kind of data is acceptable by any method, and the graphical user interfaces incorporate this by making methods available only if they are able to process the data at hand. [sent-107, score-0.266]
33 Origins The Machine Learning project at Waikato was launched in 1993 with a successful grant application to the New Zealand Foundation for Research, Science, and Technology. [sent-110, score-0.054]
34 Machine learning was selected because of prior expertise and its potential applicability to agriculture, New Zealand’s core industry; the grant was justified in terms of applications research rather than the development of new learning techniques. [sent-112, score-0.096]
35 This gave the research team a license to incorporate and reimplement existing methods, and work soon began on a workbench, written in C, that was intended to provide a common interface to a growing collection of machine learning algorithms. [sent-113, score-0.201]
36 It contained some learning algorithms written mostly in C, data I/O and pre-processing tools, also written in C, and graphical user interfaces written in TCL/TK. [sent-114, score-0.266]
37 The acronym WEKA for “Waikato Environment for Knowledge Analysis” was coined, and the system gradually became known in the international ML community, along with another machine learning library in C++ from the University of Stanford called MLC++, developed by Kohavi et al. [sent-116, score-0.039]
38 2536 WEKA—E XPERIENCES WITH A JAVA O PEN -S OURCE P ROJECT Because of dependencies on other libraries, mainly related to the graphical user interfaces, the software became increasingly unwieldy and hard to maintain. [sent-122, score-0.295]
39 In 1997 work began on reimplementing WEKA from scratch in Java into what we now term WEKA 3. [sent-123, score-0.087]
40 One of the authors, Eibe Frank, had earlier decided to adopt Java to implement algorithm prototypes, abandoning C++ because Java development was rapid and debugging was dramatically simplified. [sent-124, score-0.072]
41 This positive experience, the promise of platform independence through virtual machine technology, and the fact that, as part of the research code, classes for reading WEKA’s standard ARFF file format into appropriate data structures already existed, led to the decision to re-write WEKA in Java. [sent-125, score-0.032]
42 Many of the classes in today’s code archive (including, for example, J48, the WEKA implementation of Quinlan’s C4. [sent-126, score-0.051]
43 Rapid early development was stimulated by the need to teach a course on machine learning at the University of Calgary during a Fall 1997 sabbatical visit by Witten, along with several students, including Frank. [sent-128, score-0.103]
44 By the end of 1998 WEKA included packages for classifiers, association rule learners, filters, and evaluation, as well as a core package. [sent-131, score-0.051]
45 Work began on the first edition of the Data Mining book in 1997, based on earlier notes for Witten’s courses at the University of Waikato, and a proposal was submitted to Morgan Kaufmann late that year. [sent-134, score-0.219]
46 It finally appeared in 1999 (though for reasons that are not clear to us the official publication date is 2000). [sent-135, score-0.029]
47 The WEKA software described in that edition was command-line oriented and the book makes no mention of a graphical user interface, for which design began in 1999. [sent-138, score-0.449]
48 By the time the second edition appeared in 2005 the interactive versions of WEKA—the Explorer, Experimenter, and Knowledge Flow interface—were mature and well-tested pieces of software. [sent-139, score-0.116]
49 The size of the mailing list, the volume of downloads, and the number of academic papers citing WEKA-based results show that the software is widely deployed. [sent-142, score-0.191]
50 It is used in the machine learning and data mining community as an educational tool for teaching both applications and technical internals of machine learning algorithms, and as a research tool for developing and empirically comparing new techniques. [sent-143, score-0.147]
51 It is applied increasingly widely in other academic fields, and in commercial settings. [sent-144, score-0.049]
52 However, there are several other factors, many of which are exposed by the above brief historical review. [sent-147, score-0.03]
53 The lack of dependency on externally-maintained libraries makes maintenance of the code base much easier. [sent-151, score-0.119]
54 For one thing, portability seemed less important in the days when everyone we knew was using Unix! [sent-153, score-0.029]
55 Later versions of the Java Virtual Machine virtually eliminated the performance gap from C++ through just-in-time compilation and adaptive optimization. [sent-160, score-0.034]
56 Although there still persists a perception that execution of Java code is too slow for scientific computing, this is not our experience and does not appear to be shared by the WEKA community. [sent-161, score-0.074]
57 2 Graphical User Interface Early releases of the WEKA 3 software were command-line driven and did not include graphical user interfaces. [sent-163, score-0.256]
58 Although many experienced users still shun them, the graphical interfaces undoubtedly contributed to the popularity of WEKA. [sent-164, score-0.306]
59 The introduction of the Explorer in particular has made the software more accessible to users who want to employ machine learning in practice. [sent-165, score-0.17]
60 It has allowed many universities (including our own) to offer courses in applied machine learning and data mining, and has certainly contributed to WEKA’s popularity for commercial and industrial use. [sent-166, score-0.155]
61 Again, while obvious in hindsight, the development of graphical user interfaces was a significant risk, because valuable programming effort had to be diverted from the main job of implementing learning algorithms into relatively superficial areas of presentation and interaction. [sent-167, score-0.338]
62 The WEKA 3 graphical user interface development benefited from the fact that the core system was already up and running, and relatively mature—as evidenced by the first edition of Data Mining—before any work began on interactive interfaces. [sent-168, score-0.435]
63 The early WEKA project probably suffered from attempting to develop interactive interfaces (in TCL/TK) at the same time as the basic algorithms and data structures were being commissioned, a mistake that was avoided in the later system. [sent-169, score-0.308]
64 There has been a symbiotic relationship between the software and the book: users of the software are likely to consult the book and readers of the book are likely to try out the software. [sent-172, score-0.506]
65 The combination of a book explaining the core algorithms in a corresponding piece of free software is particularly suitable for education. [sent-173, score-0.258]
66 It seems likely that the feedback loop between the readership of the book and the users of the software has bolstered the size of both populations. [sent-174, score-0.272]
67 The early existence of a 2538 WEKA—E XPERIENCES WITH A JAVA O PEN -S OURCE P ROJECT companion book, unusual for open source software, is particularly valuable for machine learning because the techniques involved are quite simple and easy to explain, but by no means obvious. [sent-175, score-0.063]
68 This, in conjunction with the information in the mailing list archives, provides a wealth of information for all users. [sent-182, score-0.059]
69 The existence of a steadily increasing, knowledgeable and enthusiastic user community, and the combined knowledge they share, has played a significant role in the success of the software. [sent-183, score-0.078]
70 6 Comprehensiveness Perhaps the foremost reason for the adoption of WEKA 3 by the research community has been the inclusion of faithful reimplementations of classic benchmark methods, in particular the C4. [sent-185, score-0.027]
71 The original implementations of these algorithms were already very successful software projects in themselves. [sent-187, score-0.132]
72 We were lucky to receive an initial research grant for applied machine learning research from a New Zealand funding agency that approved of our aspirations to investigate the application of this technology in agricultural domains. [sent-192, score-0.03]
73 We continued to apply for, and receive, follow-on funding from the same source, but—particularly as time went on—this compelled us to channel much of our research in the direction of target applications rather than basic research in machine learning. [sent-194, score-0.03]
74 Maintaining the Project A software project can only become and remain successful if it is consistently maintained to a high standard. [sent-196, score-0.186]
75 It has been our experience that this requires a group of people who are continually involved in the management and development of the software for an extended period of time, spanning several years. [sent-197, score-0.262]
76 The core development team of WEKA has always been small and close-knit: having a small team helps maintain code quality and overall coherence. [sent-198, score-0.245]
77 Over the years, work on the project has been done by a couple of academic staff, who were involved in the longer term and fitted it in with their teaching and research duties, and a succession of one (or one and a half) full-time equivalent research programmers. [sent-200, score-0.127]
78 A fair amount of work was undertaken by students on casual contracts or as part of their studies. [sent-201, score-0.082]
79 The community also contributed many algorithm implementations that are included in the WEKA distribution, along with some patches for the basic framework. [sent-202, score-0.064]
80 The project has always had a policy of maintaining close control of what became part of the software. [sent-203, score-0.093]
81 Only a handful of developers have ever had write access to the source code repository. [sent-204, score-0.123]
82 The drawback of this policy is reduced functionality; the advantages are improved code quality, ease of maintenance, and greater coherence for both developer and end user. [sent-205, score-0.051]
83 When new algorithm implementations were considered for inclusion, we generally insisted on a backing publication describing the new method. [sent-206, score-0.029]
84 The research contract that sponsored WEKA development required some measure of commercialization, and a few commercial licenses to parts of the WEKA code base owned by the University of Waikato have been sold. [sent-208, score-0.206]
85 It eventually became clear that the succession of research contracts had a finite life span, and support by a commercial organization was necessary to keep WEKA healthy. [sent-209, score-0.146]
86 Since 2007, Pentaho Corporation, a company that provides open-source business intelligence software and support, has contributed substantially to the maintenance of WEKA by hiring one of the chief developers and providing online help. [sent-210, score-0.253]
87 As part of the requirement to commercialize the software, it has been necessary to maintain a branch in the source code repository that only contains code owned by the University of Waikato, an onerous but necessary facet of project maintenance. [sent-211, score-0.222]
88 Concluding Remarks Obviously, in almost two decades of project development, many mistakes were made—but most were quickly corrected. [sent-213, score-0.054]
89 One, mentioned above, regards the premature design of interactive interfaces, where WEKA 3 benefited from a strategic error made in the early WEKA project. [sent-214, score-0.088]
90 Below are two instances of how adoption might have been strengthened had the project been managed differently. [sent-215, score-0.054]
91 One of the most challenging aspects of managing open source software development is to decide what to include in the software. [sent-216, score-0.236]
92 Under such a scheme, packages maintained by their developers can be loaded into the system on demand, opening it up to greater diversity and flexibility. [sent-219, score-0.096]
93 A recent development in WEKA is the inclusion of package management, so that packages can easily be added to a given installation. [sent-220, score-0.099]
94 4 The project would probably have benefited by moving in this direction earlier. [sent-221, score-0.054]
95 We have learned that mailing lists for open source software are easier to maintain if the users are researchers rather than teachers. [sent-222, score-0.261]
96 Requests from students all over the world for assistance with their assignments and projects present a significant (and growing) problem; moreover, students often depart from proper mailing-list etiquette. [sent-226, score-0.106]
97 It would have been better to identify the clientele for WEKA as a teaching tool, and offer a one-stop-shop for software, documentation and help that is distinct from the support infrastructure used by researchers. [sent-227, score-0.107]
98 One of the most satisfying aspects of participating in the project is that the software has been incorporated into, and spawned, many other open-source projects. [sent-229, score-0.186]
99 Acknowledgments We gratefully acknowledge the input of all contributors to the WEKA project, and want in particular to mention significant early contributions to the WEKA 3 codebase by Len Trigg and many later contributions by Richard Kirkby, Ashraf Kibriya and Xin Xu. [sent-230, score-0.031]
100 Data mining using MLC++ a machine learning library in C++. [sent-262, score-0.052]
wordName wordTfidf (topN-words)
[('weka', 0.792), ('java', 0.213), ('waikato', 0.146), ('interfaces', 0.142), ('software', 0.132), ('book', 0.102), ('explorer', 0.08), ('witten', 0.08), ('user', 0.078), ('development', 0.072), ('nz', 0.071), ('frank', 0.071), ('eutemann', 0.069), ('itten', 0.069), ('roject', 0.069), ('xperiences', 0.069), ('attribute', 0.068), ('interface', 0.067), ('zealand', 0.066), ('viewer', 0.066), ('began', 0.061), ('ouckaert', 0.059), ('fahringer', 0.059), ('olmes', 0.059), ('mailing', 0.059), ('ource', 0.059), ('functionality', 0.058), ('interactive', 0.057), ('project', 0.054), ('students', 0.053), ('eibe', 0.053), ('mining', 0.052), ('fmean', 0.051), ('mlc', 0.051), ('reutemann', 0.051), ('code', 0.051), ('pen', 0.049), ('commercial', 0.049), ('team', 0.049), ('experimenter', 0.049), ('graphical', 0.046), ('maintenance', 0.044), ('remco', 0.044), ('arff', 0.044), ('pfahringer', 0.044), ('workbench', 0.044), ('teaching', 0.044), ('popularity', 0.043), ('flow', 0.04), ('ted', 0.04), ('developers', 0.04), ('became', 0.039), ('users', 0.038), ('documentation', 0.037), ('contributed', 0.037), ('today', 0.036), ('holmes', 0.036), ('public', 0.035), ('management', 0.035), ('compilation', 0.034), ('owned', 0.034), ('ripper', 0.034), ('throws', 0.034), ('unix', 0.034), ('virtual', 0.032), ('source', 0.032), ('kaufmann', 0.031), ('early', 0.031), ('historical', 0.03), ('funding', 0.03), ('edition', 0.03), ('mature', 0.029), ('bouckaert', 0.029), ('contracts', 0.029), ('loaded', 0.029), ('portability', 0.029), ('succession', 0.029), ('cs', 0.029), ('publication', 0.029), ('morgan', 0.029), ('ac', 0.028), ('community', 0.027), ('packages', 0.027), ('accessed', 0.026), ('courses', 0.026), ('quinlan', 0.026), ('infrastructure', 0.026), ('scratch', 0.026), ('classifier', 0.026), ('api', 0.026), ('classi', 0.025), ('educational', 0.024), ('libraries', 0.024), ('license', 0.024), ('proc', 0.024), ('suffered', 0.024), ('visualization', 0.024), ('core', 0.024), ('bernhard', 0.023), ('experience', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project
Author: Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten
Abstract: WEKA is a popular machine learning workbench with a development life of nearly two decades. This article provides an overview of the factors that we believe to be important to its success. Rather than focussing on the software’s functionality, we review aspects of project management and historical development decisions that likely had an impact on the uptake of the project. Keywords: machine learning software, open source software
2 0.14016555 70 jmlr-2010-MOA: Massive Online Analysis
Author: Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer
Abstract: Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Na¨ve Bayes classifiers at the leaves. MOA ı supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license. Keywords: data streams, classification, ensemble methods, java, machine learning software
3 0.068284132 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
4 0.057138912 118 jmlr-2010-libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models
Author: Joris M. Mooij
Abstract: This paper describes the software package libDAI, a free & open source C++ library that provides implementations of various exact and approximate inference methods for graphical models with discrete-valued variables. libDAI supports directed graphical models (Bayesian networks) as well as undirected ones (Markov random fields and factor graphs). It offers various approximations of the partition sum, marginal probability distributions and maximum probability states. Parameter learning is also supported. A feature comparison with other open source software packages for approximate inference is given. libDAI is licensed under the GPL v2+ license and is available at http://www.libdai.org. Keywords: probabilistic graphical models, approximate inference, open source software, factor graphs, Markov random fields, Bayesian networks
5 0.048533257 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
Author: Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojtěch Franc
Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the specific algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. Keywords: support vector machines, kernels, large-scale learning, Python, Octave, R
6 0.035586622 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine
7 0.033684805 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
8 0.032934658 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
9 0.031673703 63 jmlr-2010-Learning Instance-Specific Predictive Models
10 0.030932799 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design
11 0.027958613 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
12 0.026933776 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
13 0.025501516 93 jmlr-2010-PyBrain
14 0.024653807 39 jmlr-2010-FastInf: An Efficient Approximate Inference Library
15 0.021259019 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
16 0.018935001 22 jmlr-2010-Classification Using Geometric Level Sets
17 0.018841695 15 jmlr-2010-Approximate Tree Kernels
18 0.018302429 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm
19 0.017398143 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
20 0.016724244 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
topicId topicWeight
[(0, -0.089), (1, 0.044), (2, -0.051), (3, 0.007), (4, 0.001), (5, 0.099), (6, 0.011), (7, -0.035), (8, -0.094), (9, 0.134), (10, 0.087), (11, -0.049), (12, -0.102), (13, 0.092), (14, 0.065), (15, -0.021), (16, 0.003), (17, -0.095), (18, -0.182), (19, -0.109), (20, -0.186), (21, 0.15), (22, 0.273), (23, -0.338), (24, -0.279), (25, 0.018), (26, 0.075), (27, 0.145), (28, -0.014), (29, -0.068), (30, 0.145), (31, 0.003), (32, 0.048), (33, 0.096), (34, 0.065), (35, -0.024), (36, 0.084), (37, -0.054), (38, -0.021), (39, -0.036), (40, 0.046), (41, 0.015), (42, 0.047), (43, -0.019), (44, 0.027), (45, -0.04), (46, 0.022), (47, -0.055), (48, 0.009), (49, -0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.9783299 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project
Author: Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten
Abstract: WEKA is a popular machine learning workbench with a development life of nearly two decades. This article provides an overview of the factors that we believe to be important to its success. Rather than focussing on the software’s functionality, we review aspects of project management and historical development decisions that likely had an impact on the uptake of the project. Keywords: machine learning software, open source software
2 0.92218465 70 jmlr-2010-MOA: Massive Online Analysis
Author: Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer
Abstract: Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Na¨ve Bayes classifiers at the leaves. MOA ı supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license. Keywords: data streams, classification, ensemble methods, java, machine learning software
3 0.28914857 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
Author: Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojtěch Franc
Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the specific algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is , implemented in C++ and interfaces to MATLABTM R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org. Keywords: support vector machines, kernels, large-scale learning, Python, Octave, R
4 0.24204212 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
Author: Joris M. Mooij
Abstract: This paper describes the software package libDAI, a free & open source C++ library that provides implementations of various exact and approximate inference methods for graphical models with discrete-valued variables. libDAI supports directed graphical models (Bayesian networks) as well as undirected ones (Markov random fields and factor graphs). It offers various approximations of the partition sum, marginal probability distributions and maximum probability states. Parameter learning is also supported. A feature comparison with other open source software packages for approximate inference is given. libDAI is licensed under the GPL v2+ license and is available at http://www.libdai.org. Keywords: probabilistic graphical models, approximate inference, open source software, factor graphs, Markov random fields, Bayesian networks
6 0.18851796 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design
7 0.15433381 93 jmlr-2010-PyBrain
8 0.1501888 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
9 0.13398872 63 jmlr-2010-Learning Instance-Specific Predictive Models
10 0.13060601 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
11 0.12992157 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
12 0.12985113 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory
13 0.1251567 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine
14 0.11701982 102 jmlr-2010-Semi-Supervised Novelty Detection
15 0.1162859 39 jmlr-2010-FastInf: An Efficient Approximate Inference Library
16 0.11580327 41 jmlr-2010-Gaussian Processes for Machine Learning (GPML) Toolbox
17 0.10076954 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm
18 0.097879633 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
19 0.097233199 22 jmlr-2010-Classification Using Geometric Level Sets
20 0.096959703 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
topicId topicWeight
[(8, 0.016), (21, 0.011), (24, 0.026), (32, 0.038), (33, 0.294), (36, 0.028), (37, 0.097), (40, 0.019), (56, 0.106), (75, 0.118), (81, 0.021), (85, 0.071), (96, 0.028), (97, 0.02)]
simIndex simValue paperId paperTitle
Author: Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos
Abstract: In part I of this work we introduced and evaluated the Generalized Local Learning (GLL) framework for producing local causal and Markov blanket induction algorithms. In the present second part we analyze the behavior of GLL algorithms and provide extensions to the core methods. SpeciÄ?Ĺš cally, we investigate the empirical convergence of GLL to the true local neighborhood as a function of sample size. Moreover, we study how predictivity improves with increasing sample size. Then we investigate how sensitive are the algorithms to multiple statistical testing, especially in the presence of many irrelevant features. Next we discuss the role of the algorithm parameters and also show that Markov blanket and causal graph concepts can be used to understand deviations from optimality of state-of-the-art non-causal algorithms. The present paper also introduces the following extensions to the core GLL framework: parallel and distributed versions of GLL algorithms, versions with false discovery rate control, strategies for constructing novel heuristics for speciÄ?Ĺš c domains, and divide-and-conquer local-to-global learning (LGL) strategies. We test the generality of the LGL approach by deriving a novel LGL-based algorithm that compares favorably c 2010 Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani and Xenofon D. Koutsoukos. A LIFERIS , S TATNIKOV, T SAMARDINOS , M ANI AND KOUTSOUKOS to the state-of-the-art global learning algorithms. In addition, we investigate the use of non-causal feature selection methods to facilitate global learning. Open problems and future research paths related to local and local-to-global causal learning are discussed. Keywords: local causal discovery, Markov blanket induction, feature selection, classiÄ?Ĺš cation, causal structure learning, learning of Bayesian networks
same-paper 2 0.82435411 116 jmlr-2010-WEKA−Experiences with a Java Open-Source Project
Author: Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten
Abstract: WEKA is a popular machine learning workbench with a development life of nearly two decades. This article provides an overview of the factors that we believe to be important to its success. Rather than focussing on the software’s functionality, we review aspects of project management and historical development decisions that likely had an impact on the uptake of the project. Keywords: machine learning software, open source software
3 0.78258318 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity
Author: Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, Patrik O. Hoyer
Abstract: Analysis of causal effects between continuous-valued variables typically uses either autoregressive models or structural equation models with instantaneous effects. Estimation of Gaussian, linear structural equation models poses serious identifiability problems, which is why it was recently proposed to use non-Gaussian models. Here, we show how to combine the non-Gaussian instantaneous model with autoregressive models. This is effectively what is called a structural vector autoregression (SVAR) model, and thus our work contributes to the long-standing problem of how to estimate SVAR’s. We show that such a non-Gaussian model is identifiable without prior knowledge of network structure. We propose computationally efficient methods for estimating the model, as well as methods to assess the significance of the causal influences. The model is successfully applied on financial and brain imaging data. Keywords: structural vector autoregression, structural equation models, independent component analysis, non-Gaussianity, causality
Author: Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos
Abstract: We present an algorithmic framework for learning local causal structure around target variables of interest in the form of direct causes/effects and Markov blankets applicable to very large data sets with relatively small samples. The selected feature sets can be used for causal discovery and classiÄ?Ĺš cation. The framework (Generalized Local Learning, or GLL) can be instantiated in numerous ways, giving rise to both existing state-of-the-art as well as novel algorithms. The resulting algorithms are sound under well-deÄ?Ĺš ned sufÄ?Ĺš cient conditions. In a Ä?Ĺš rst set of experiments we evaluate several algorithms derived from this framework in terms of predictivity and feature set parsimony and compare to other local causal discovery methods and to state-of-the-art non-causal feature selection methods using real data. A second set of experimental evaluations compares the algorithms in terms of ability to induce local causal neighborhoods using simulated and resimulated data and examines the relation of predictivity with causal induction performance. Our experiments demonstrate, consistently with causal feature selection theory, that local causal feature selection methods (under broad assumptions encompassing appropriate family of distribuc 2010 Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani and Xenofon D. Koutsoukos. A LIFERIS , S TATNIKOV, T SAMARDINOS , M ANI AND KOUTSOUKOS tions, types of classiÄ?Ĺš ers, and loss functions) exhibit strong feature set parsimony, high predictivity and local causal interpretability. Although non-causal feature selection methods are often used in practice to shed light on causal relationships, we Ä?Ĺš nd that they cannot be interpreted causally even when they achieve excellent predictivity. Therefore we conclude that only local causal techniques should be used when insight into causal structure is sought. In a companion paper we examine in depth the behavior of GLL algorithms, provide extensions, and show
5 0.55204773 56 jmlr-2010-Introduction to Causal Inference
Author: Peter Spirtes
Abstract: The goal of many sciences is to understand the mechanisms by which variables came to take on the values they have (that is, to find a generative model), and to predict what the values of those variables would be if the naturally occurring mechanisms were subject to outside manipulations. The past 30 years has seen a number of conceptual developments that are partial solutions to the problem of causal inference from observational sample data or a mixture of observational sample and experimental data, particularly in the area of graphical causal modeling. However, in many domains, problems such as the large numbers of variables, small samples sizes, and possible presence of unmeasured causes, remain serious impediments to practical applications of these developments. The articles in the Special Topic on Causality address these and other problems in applying graphical causal modeling algorithms. This introduction to the Special Topic on Causality provides a brief introduction to graphical causal modeling, places the articles in a broader context, and describes the differences between causal inference and ordinary machine learning classification and prediction problems. Keywords: Bayesian networks, causation, causal inference
6 0.54524845 118 jmlr-2010-libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models
7 0.53752691 70 jmlr-2010-MOA: Massive Online Analysis
8 0.4832395 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
9 0.47265169 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
10 0.47171763 63 jmlr-2010-Learning Instance-Specific Predictive Models
11 0.46694741 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression
12 0.46646494 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
13 0.46430519 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
14 0.46306291 69 jmlr-2010-Lp-Nested Symmetric Distributions
15 0.46302664 110 jmlr-2010-The SHOGUN Machine Learning Toolbox
16 0.46193519 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
17 0.45755741 102 jmlr-2010-Semi-Supervised Novelty Detection
18 0.45617718 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
19 0.45560333 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
20 0.45168951 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference