jmlr jmlr2009 jmlr2009-43 knowledge-graph by maker-knowledge-mining

43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)

Source: pdf

Author: Thomas Abeel, Yves Van de Peer, Yvan Saeys

Abstract: Java-ML is a collection of machine learning and data mining algorithms, which aims to be a readily usable and easily extensible API for both software developers and research scientists. The interfaces for each type of algorithm are kept simple and algorithms strictly follow their respective interface. Comparing different classiﬁers or clustering algorithms is therefore straightforward, and implementing new algorithms is also easy. The implementations of the algorithms are clearly written, properly documented and can thus be used as a reference. The library is written in Java and is available from http://java-ml.sourceforge.net/ under the GNU GPL license. Keywords: open source, machine learning, data mining, java library, clustering, feature selection, classiﬁcation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The interfaces for each type of algorithm are kept simple and algorithms strictly follow their respective interface. [sent-11, score-0.248]

2 Comparing different classiﬁers or clustering algorithms is therefore straightforward, and implementing new algorithms is also easy. [sent-12, score-0.378]

3 The implementations of the algorithms are clearly written, properly documented and can thus be used as a reference. [sent-13, score-0.096]

4 The library is written in Java and is available from http://java-ml. [sent-14, score-0.254]

5 Keywords: open source, machine learning, data mining, java library, clustering, feature selection, classiﬁcation 1. [sent-17, score-0.156]

6 Introduction Machine learning techniques are increasingly popular in research ﬁelds like bio- and chemoinformatics, text and web mining, as well as many other areas of research and industry. [sent-18, score-0.035]

7 In this paper we present Java-ML: a cross-platform, open source machine learning library written in Java. [sent-19, score-0.386]

8 Several well-known data mining libraries already exist, including for example, Weka (Witten and Frank, 2005) and Yale/RapidMiner (Mierswa et al. [sent-20, score-0.106]

9 These programs provide a userfriendly interface and are geared towards interactive use with the user. [sent-22, score-0.217]

10 In contrast to these programs, Java-ML is oriented towards developers that want to use machine learning in their own programs. [sent-23, score-0.145]

11 To this end, Java-ML interfaces are restricted to the essentials, and are very easy to understand. [sent-24, score-0.248]

12 As a result, Java-ML facilitates a broad exploration of different models, is straightforward to integrate into your own source code, and can be easily extended. [sent-25, score-0.311]

13 Java-ML contains an extensive set of similarity based techniques, and offers state-of-the-art feature selection techniques. [sent-27, score-0.176]

14 The large number of similarity functions allow for a broad set of clustering and instance based learning techniques, while the feature selection techniques are well suited to deal with high-dimensional domains, such as the ones often encountered in bioinformatics and biomedical applications. [sent-28, score-0.676]

15 Description of the Library In this section we ﬁrst describe the software design of Java-ML, we then discuss how to integrate it in your program and ﬁnally we cover the documentation. [sent-33, score-0.121]

16 1 Structure of the Library The library is built around two core interfaces: Dataset and Instance. [sent-35, score-0.254]

17 These two interfaces have several implementations for different types of samples. [sent-36, score-0.29]

18 The machine learning algorithms implement one of the following interfaces: Clusterer, Classifier, FeatureScoring, FeatureRanking or FeatureSubsetSelection. [sent-37, score-0.049]

19 Distance, correlation and similarity measures implement the interface DistanceMeasure. [sent-38, score-0.352]

20 These distance measures can be used in many algorithms to modify their behavior. [sent-39, score-0.157]

21 Cluster evaluation measures are deﬁned by the ClusterEvaluation interface. [sent-40, score-0.145]

22 Manipulation ﬁlters either implement InstanceFilter or DatasetFilter, depending on the level they work on. [sent-41, score-0.049]

23 All implementing classes for each of the interfaces are available from the API documentation that is available on the Java-ML website. [sent-42, score-0.343]

24 Each of these interfaces provides one or two methods that are required to execute the algorithm on a particular data set. [sent-43, score-0.287]

25 Several utility classes make it easy to load data from tab or comma separated ﬁles and from ARFF formatted ﬁles. [sent-44, score-0.175]

26 An overview of the main algorithms included in Java-ML can be found in Table 1. [sent-45, score-0.043]

27 The library provides several algorithms that have not been made available before in a bundled form. [sent-46, score-0.254]

28 In particular, clustering algorithms and the accompanying cluster evaluation measures are extensively represented. [sent-47, score-0.617]

29 This includes the adaptive quality-based clustering algorithm, density based methods, self-organizing maps (both as clustering and classiﬁcation algorithm) and numerous other 932 JAVA -ML: A M ACHINE L EARNING L IBRARY well-known clustering algorithms. [sent-48, score-1.001]

30 A large number of distance, similarity and correlation measures are included. [sent-49, score-0.24]

31 Feature selection algorithms include traditional algorithms like symmetrical uncertainty, gain ratio, RELIEF, stepwise addition/removal, as well as a number of more recent methods (SVMRFE and random forest attribute evaluation). [sent-50, score-0.26]

32 Also the recently introduced concept of ensemble feature selection techniques (Saeys et al. [sent-51, score-0.162]

33 We have also implemented a fast and simple random tree algorithm to cope with high dimensional, sparse and ambiguous data. [sent-53, score-0.131]

34 Finally, we provide bridges for classiﬁcation and clustering in Weka and libsvm (Fan et al. [sent-54, score-0.476]

35 2 Easy Integration in Your Own Source Code Including Java-ML algorithms in your own source code is very simple. [sent-57, score-0.222]

36 To illustrate this, we present here two short code fragments that demonstrate the ease to integrate the library. [sent-58, score-0.257]

37 The following lines of code integrate a K-Means clustering algorithm in your own program. [sent-59, score-0.533]

38 cluster(data); The ﬁrst line uses the FileHandler utility to load data from the iris. [sent-63, score-0.124]

39 In this ﬁle, the class label is on the fourth position and the ﬁelds are separated by a comma. [sent-65, score-0.051]

40 The second line constructs a new instance of the KMeans clustering algorithm with default values, in this case k=4. [sent-66, score-0.365]

41 The third line uses the KMeans instance to cluster the data that we loaded in the ﬁrst line. [sent-67, score-0.193]

42 The resulting clusters will be returned as an array of data sets. [sent-68, score-0.082]

43 The following example illustrates how to perform a cross-validation experiment for a speciﬁc dataset and classiﬁer. [sent-69, score-0.163]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('clustering', 0.322), ('library', 0.254), ('interfaces', 0.248), ('abeel', 0.213), ('psb', 0.213), ('yvan', 0.213), ('yves', 0.213), ('saeys', 0.181), ('dataset', 0.163), ('kmeans', 0.162), ('weka', 0.162), ('classifier', 0.142), ('clusterer', 0.142), ('file', 0.142), ('peer', 0.142), ('source', 0.132), ('integrate', 0.121), ('arff', 0.121), ('java', 0.111), ('bridges', 0.108), ('api', 0.108), ('forests', 0.108), ('stepwise', 0.108), ('measures', 0.105), ('developers', 0.099), ('cluster', 0.096), ('knn', 0.092), ('code', 0.09), ('manipulation', 0.087), ('lters', 0.087), ('similarity', 0.085), ('crossvalidation', 0.082), ('thomas', 0.082), ('ensemble', 0.071), ('interface', 0.063), ('utility', 0.063), ('programs', 0.061), ('load', 0.061), ('eer', 0.06), ('gent', 0.06), ('rfe', 0.06), ('chemoinformatics', 0.06), ('symmetrical', 0.06), ('mining', 0.06), ('broad', 0.058), ('implementing', 0.056), ('loaded', 0.054), ('accompanying', 0.054), ('extensible', 0.054), ('geared', 0.054), ('gnu', 0.054), ('plant', 0.054), ('ghent', 0.054), ('relief', 0.054), ('usable', 0.054), ('documented', 0.054), ('distance', 0.052), ('separated', 0.051), ('van', 0.05), ('correlation', 0.05), ('utilities', 0.05), ('gpl', 0.05), ('ambiguous', 0.05), ('achine', 0.05), ('implement', 0.049), ('clusters', 0.048), ('selection', 0.046), ('oriented', 0.046), ('forest', 0.046), ('fragments', 0.046), ('libraries', 0.046), ('organizing', 0.046), ('libsvm', 0.046), ('feature', 0.045), ('cope', 0.043), ('bagging', 0.043), ('biomedical', 0.043), ('ren', 0.043), ('self', 0.043), ('witten', 0.043), ('belgium', 0.043), ('instance', 0.043), ('overview', 0.043), ('implementations', 0.042), ('loading', 0.041), ('elds', 0.04), ('evaluation', 0.04), ('interactive', 0.039), ('documentation', 0.039), ('execute', 0.039), ('tree', 0.038), ('km', 0.037), ('sonnenburg', 0.037), ('discretization', 0.035), ('integration', 0.035), ('fan', 0.035), ('increasingly', 0.035), ('maps', 0.035), ('encountered', 0.034), ('array', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)

Author: Thomas Abeel, Yves Van de Peer, Yvan Saeys

2 0.18849146 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)

Author: Davis E. King

Abstract: There are many excellent toolkits which provide support for developing machine learning software in Python, R, Matlab, and similar environments. Dlib-ml is an open source library, targeted at both engineers and research scientists, which aims to provide a similarly rich environment for developing machine learning software in the C++ language. Towards this end, dlib-ml contains an extensible linear algebra toolkit with built in BLAS support. It also houses implementations of algorithms for performing inference in Bayesian networks and kernel-based methods for classiﬁcation, regression, clustering, anomaly detection, and feature ranking. To enable easy use of these tools, the entire library has been developed with contract programming, which provides complete and precise documentation as well as powerful debugging tools. Keywords: kernel-methods, svm, rvm, kernel clustering, C++, Bayesian networks

3 0.10710014 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)

Author: Troy Raeder, Nitesh V. Chawla

Abstract: This paper presents Model Monitor (M 2 ), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M 2 provides a simple and intuitive framework in which users can evaluate classiﬁers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M 2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classiﬁers can be used if desired. Keywords: machine learning, open-source software, distribution shift, scenario analysis

4 0.096136995 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

Author: Sébastien Bubeck, Ulrike von Luxburg

Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to ﬁnd, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the ﬁnite data set has been sampled from some underlying space, the goal is not to ﬁnd the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal, and instead can lead to inconsistency. We construct examples which provably have this behavior. As in the case of supervised learning, the cure is to restrict the size of the function classes under consideration. For appropriate “small” function classes we can prove very general consistency theorems for clustering optimization schemes. As one particular algorithm for clustering with a restricted function space we introduce “nearest neighbor clustering”. Similar to the k-nearest neighbor classiﬁer in supervised learning, this algorithm can be seen as a general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is statistically consistent for all commonly used clustering objective functions. Keywords: clustering, minimizing objective functions, consistency

5 0.086717896 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

Author: Jie Chen, Haw-ren Fang, Yousef Saad

Abstract: Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact kNN graph takes Θ(dn2 ) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dnt ) time for high dimensional data (large d). The exponent t ∈ (1, 2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional reﬁnement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs. Keywords: nearest neighbors graph, high dimensional data, divide and conquer, Lanczos algorithm, spectral method

6 0.081702597 20 jmlr-2009-DL-Learner: Learning Concepts in Description Logics

7 0.063320585 77 jmlr-2009-RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments (Machine Learning Open Source Software Paper)

8 0.059194516 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

9 0.054964595 96 jmlr-2009-Transfer Learning for Reinforcement Learning Domains: A Survey

10 0.051679641 90 jmlr-2009-Structure Spaces

11 0.050294772 76 jmlr-2009-Python Environment for Bayesian Learning: Inferring the Structure of Bayesian Networks from Knowledge and Data (Machine Learning Open Source Software Paper)

12 0.047823407 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

13 0.043769479 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

14 0.04207911 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination (Special Topic on Model Selection)

15 0.040240329 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification

16 0.034795702 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

17 0.034464356 38 jmlr-2009-Hash Kernels for Structured Data

18 0.034001671 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

19 0.033939503 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models (Machine Learning Open Source Software Paper)

20 0.033895116 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.14), (1, -0.142), (2, 0.075), (3, -0.103), (4, 0.082), (5, -0.261), (6, 0.275), (7, 0.061), (8, -0.002), (9, 0.186), (10, -0.129), (11, 0.252), (12, -0.11), (13, 0.298), (14, 0.086), (15, -0.026), (16, 0.082), (17, -0.058), (18, -0.114), (19, 0.002), (20, 0.016), (21, 0.036), (22, -0.137), (23, 0.049), (24, 0.005), (25, 0.02), (26, -0.041), (27, 0.053), (28, -0.039), (29, -0.086), (30, 0.098), (31, -0.076), (32, 0.088), (33, -0.049), (34, 0.012), (35, -0.014), (36, -0.003), (37, -0.057), (38, 0.065), (39, 0.044), (40, -0.065), (41, 0.064), (42, 0.023), (43, -0.034), (44, -0.065), (45, 0.048), (46, -0.007), (47, -0.095), (48, 0.019), (49, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98280835 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)

Author: Thomas Abeel, Yves Van de Peer, Yvan Saeys

2 0.74851376 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)

Author: Davis E. King

3 0.44077706 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)

Author: Troy Raeder, Nitesh V. Chawla

4 0.37875617 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

Author: Sébastien Bubeck, Ulrike von Luxburg

5 0.37213901 20 jmlr-2009-DL-Learner: Learning Concepts in Description Logics

Author: Jens Lehmann

Abstract: In this paper, we introduce DL-Learner, a framework for learning in description logics and OWL. OWL is the ofÄ?Ĺš cial W3C standard ontology language for the Semantic Web. Concepts in this language can be learned for constructing and maintaining OWL ontologies or for solving problems similar to those in Inductive Logic Programming. DL-Learner includes several learning algorithms, support for different OWL formats, reasoner interfaces, and learning problems. It is a cross-platform framework implemented in Java. The framework allows easy programmatic access and provides a command line interface, a graphical interface as well as a WSDL-based web service. Keywords: concept learning, description logics, OWL, classiÄ?Ĺš cation, open-source

6 0.31588909 77 jmlr-2009-RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments (Machine Learning Open Source Software Paper)

7 0.2790738 76 jmlr-2009-Python Environment for Bayesian Learning: Inferring the Structure of Bayesian Networks from Knowledge and Data (Machine Learning Open Source Software Paper)

8 0.2520242 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

9 0.25126639 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

10 0.21829164 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination (Special Topic on Model Selection)

11 0.21765172 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

12 0.20753177 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

13 0.20672631 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

14 0.1855083 90 jmlr-2009-Structure Spaces

15 0.17557263 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification

16 0.17215022 96 jmlr-2009-Transfer Learning for Reinforcement Learning Domains: A Survey

17 0.16802342 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

18 0.15835808 50 jmlr-2009-Learning When Concepts Abound

19 0.1564313 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

20 0.14433429 4 jmlr-2009-A Survey of Accuracy Evaluation Metrics of Recommendation Tasks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.053), (26, 0.038), (38, 0.038), (52, 0.029), (55, 0.012), (58, 0.026), (66, 0.053), (90, 0.042), (91, 0.601)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78745914 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)

Author: Thomas Abeel, Yves Van de Peer, Yvan Saeys

2 0.27462727 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)

Author: Davis E. King

3 0.21229537 76 jmlr-2009-Python Environment for Bayesian Learning: Inferring the Structure of Bayesian Networks from Knowledge and Data (Machine Learning Open Source Software Paper)

Author: Abhik Shah, Peter Woolf

Abstract: In this paper, we introduce PEBL, a Python library and application for learning Bayesian network structure from data and prior knowledge that provides features unmatched by alternative software packages: the ability to use interventional data, ﬂexible speciﬁcation of structural priors, modeling with hidden variables and exploitation of parallel processing. PEBL is released under the MIT open-source license, can be installed from the Python Package Index and is available at http://pebl-project.googlecode.com. Keywords: Bayesian networks, python, open source software

4 0.16865599 20 jmlr-2009-DL-Learner: Learning Concepts in Description Logics

Author: Jens Lehmann

5 0.15400271 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression

Author: Sylvain Arlot, Pascal Massart

Abstract: Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from data. We propose a completely data-driven calibration algorithm for these parameters in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birg´ and Massart (2007) in the context of penalized least squares for Gaussian hoe moscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a datadriven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a ﬁrst step towards further results, since the approach and the method that we use are indeed general. Keywords: data-driven calibration, non-parametric regression, model selection by penalization, heteroscedastic data, regressogram

6 0.1448791 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models (Machine Learning Open Source Software Paper)

7 0.14040893 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

8 0.13875192 29 jmlr-2009-Estimating Labels from Label Proportions

9 0.13650073 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)

10 0.13635197 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning

11 0.13550481 48 jmlr-2009-Learning Nondeterministic Classifiers

12 0.13322005 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination (Special Topic on Model Selection)

13 0.13270676 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

14 0.13147403 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

15 0.13143618 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

16 0.13102798 38 jmlr-2009-Hash Kernels for Structured Data

17 0.13079953 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

18 0.1305085 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

19 0.13011307 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm (Special Topic on Mining and Learning with Graphs and Relations)

20 0.12936778 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning