jmlr jmlr2013 jmlr2013-83 knowledge-graph by maker-knowledge-mining

83 jmlr-2013-Orange: Data Mining Toolbox in Python


Source: pdf

Author: Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, Blaž Zupan

Abstract: Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming. Here we report on the scripting part, which features interactive data analysis and component-based assembly of data mining procedures. In the selection and design of components, we focus on the flexibility of their reuse: our principal intention is to let the user write simple and clear scripts in Python, which build upon C++ implementations of computationallyintensive tasks. Orange is intended both for experienced users and programmers, as well as for students of data mining. Keywords: Python, data mining, machine learning, toolbox, scripting

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 SI Faculty of Computer and Information Science University of Ljubljana Trˇ aˇka 25, SI-1000 Ljubljana, Slovenia z s Abstract Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming. [sent-49, score-0.268]

2 Here we report on the scripting part, which features interactive data analysis and component-based assembly of data mining procedures. [sent-50, score-0.363]

3 Orange is intended both for experienced users and programmers, as well as for students of data mining. [sent-52, score-0.025]

4 Keywords: Python, data mining, machine learning, toolbox, scripting 1. [sent-53, score-0.226]

5 Within the context of explorative data analysis, they offer advantages like interactivity and fast prototyping by gluing together existing components or adapting them for new tasks. [sent-55, score-0.086]

6 Python is a scripting language with clear and simple syntax, which also made it popular in education. [sent-56, score-0.226]

7 Its relatively slow execution can be circumvented by using libraries that implement the computationally intensive tasks in lowlevel languages. [sent-57, score-0.085]

8 Many are related to machine learning, including several general packages like scikit-learn (Pedregosa et al. [sent-59, score-0.028]

9 It focuses on simplicity, interactivity through scripting, and component-based design. [sent-64, score-0.043]

10 Toolbox Overview Orange library is a hierarchically-organized toolbox of data mining components. [sent-67, score-0.129]

11 The low-level procedures at the bottom of the hierarchy, like data filtering, probability assessment and feature scoring, are assembled into higher-level algorithms, such as classification tree learning. [sent-68, score-0.029]

12 The library is designed to simplify the assembly of data analysis workflows and crafting of data mining approaches from a combination of existing components. [sent-71, score-0.131]

13 Orange scripting library is also a foundation for its visual programming platform with graphical user interface components for interactive data visualization. [sent-73, score-0.31]

14 The two major packages that are similar to Orange and are still actively developed are scikitlearn (Pedregosa et al. [sent-74, score-0.028]

15 Both are more tightly integrated with numpy and at present better blend into Python’s numerical computing habitat. [sent-77, score-0.065]

16 Orange was on the other hand inspired by classical machine learning that focuses on symbolic methods. [sent-78, score-0.033]

17 These features also make Orange more suitable for interactive, explorative data analysis. [sent-82, score-0.043]

18 7635593576372478] We first read the data on survival of 2,201 passengers from HMS Titanic and construct a set of learning algorithms: a naive Bayesian and SVM learner, and a stacked combination of the two (Wolpert, 1992). [sent-104, score-0.086]

19 Running stacking on the subset of about 470 female passengers improves AUC score: >>> females = Orange. [sent-106, score-0.218]

20 Table([d for d in data if d["sex"]=="female"]) >>> len(females) 470 >>> res = Orange. [sent-108, score-0.116]

21 The following example defines a new learner that encloses another learner into a feature selection wrapper: it sorts the features by their information gain (as implemented in Orange. [sent-117, score-0.064]

22 InfoGain), constructs a new data set with only the m best features and calls the base learner. [sent-120, score-0.024]

23 base_learner = base_learner def __call__(self, data, weights=None): gain = Orange. [sent-125, score-0.031]

24 Code Design Orange’s core is a collection of nearly 200 C++ classes that cover the basic data structures and majority of preprocessing and modeling algorithms. [sent-157, score-0.034]

25 The C++ part is self-contained, without any calls to Python that would induce unnecessary overhead. [sent-158, score-0.024]

26 The core includes several open source libraries, including LIBSVM (Chang and Lin, 2011), LIBLINEAR (Fan et al. [sent-159, score-0.034]

27 The Python layer also uses popular Python libraries numpy for linear algebra, networkx (Hagberg et al. [sent-167, score-0.176]

28 , 2008) for working with networks and matplotlib (Hunter, 2007) for basic visualization. [sent-168, score-0.05]

29 The upper layer of Orange is written in Python and includes procedures that are not time-critical. [sent-169, score-0.055]

30 This is also the place at which users outside the core development group most easily contribute to the project. [sent-170, score-0.059]

31 The code is hosted on Bitbucket repository (https:// bitbucket. [sent-175, score-0.032]

32 Orange runs on Windows, Mac OS X and Linux, and can also be installed from the Python Package Index repository (pip install Orange). [sent-177, score-0.032]

33 There, we will switch to numpy-based data structures and scrap the C++ core in favor of using routines from numpy and scipy (Jones et al. [sent-185, score-0.142]

34 , 2011) and similar libraries that did not exist when Orange was first conceived. [sent-187, score-0.085]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('orange', 0.43), ('fri', 0.402), ('lj', 0.402), ('python', 0.256), ('scripting', 0.226), ('uni', 0.139), ('si', 0.12), ('res', 0.116), ('lan', 0.107), ('len', 0.101), ('toma', 0.101), ('pedregosa', 0.086), ('libraries', 0.085), ('albanese', 0.075), ('curk', 0.075), ('females', 0.075), ('gorup', 0.075), ('janez', 0.075), ('jure', 0.075), ('marinka', 0.075), ('marko', 0.075), ('matija', 0.075), ('miha', 0.075), ('mitar', 0.075), ('polajnar', 0.075), ('stajdohar', 0.075), ('toplak', 0.075), ('umek', 0.075), ('zagar', 0.075), ('zbontar', 0.075), ('zitnik', 0.075), ('zupan', 0.075), ('mlpy', 0.065), ('numpy', 0.065), ('ar', 0.054), ('ale', 0.05), ('assembly', 0.05), ('bla', 0.05), ('blackford', 0.05), ('crt', 0.05), ('erjavec', 0.05), ('evar', 0.05), ('fsslearner', 0.05), ('hagberg', 0.05), ('ina', 0.05), ('ining', 0.05), ('matplotlib', 0.05), ('milutinovi', 0.05), ('nbc', 0.05), ('oolbox', 0.05), ('passengers', 0.05), ('rjavec', 0.05), ('schaul', 0.05), ('stacking', 0.05), ('stari', 0.05), ('tomaz', 0.05), ('urk', 0.05), ('ython', 0.05), ('toolbox', 0.048), ('stack', 0.047), ('interactive', 0.045), ('explorative', 0.043), ('ljubljana', 0.043), ('female', 0.043), ('interactivity', 0.043), ('blas', 0.043), ('scipy', 0.043), ('mining', 0.042), ('library', 0.039), ('forests', 0.039), ('mo', 0.039), ('titanic', 0.039), ('chang', 0.036), ('dem', 0.036), ('hierarchy', 0.036), ('ho', 0.036), ('stacked', 0.036), ('core', 0.034), ('libsvm', 0.034), ('scoring', 0.034), ('documentation', 0.033), ('symbolic', 0.033), ('os', 0.033), ('liblinear', 0.033), ('martin', 0.033), ('learner', 0.032), ('repository', 0.032), ('def', 0.031), ('mac', 0.031), ('barber', 0.031), ('windows', 0.03), ('self', 0.03), ('procedures', 0.029), ('packages', 0.028), ('jones', 0.027), ('layer', 0.026), ('users', 0.025), ('calls', 0.024), ('al', 0.023), ('trees', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 83 jmlr-2013-Orange: Data Mining Toolbox in Python

Author: Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, Blaž Zupan

Abstract: Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming. Here we report on the scripting part, which features interactive data analysis and component-based assembly of data mining procedures. In the selection and design of components, we focus on the flexibility of their reuse: our principal intention is to let the user write simple and clear scripts in Python, which build upon C++ implementations of computationallyintensive tasks. Orange is intended both for experienced users and programmers, as well as for students of data mining. Keywords: Python, data mining, machine learning, toolbox, scripting

2 0.054803066 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

Author: Sergey Lisitsyn, Christian Widmer, Fernando J. Iglesias Garcia

Abstract: We present Tapkee, a C++ template library that provides efficient implementations of more than 20 widely used dimensionality reduction techniques ranging from Locally Linear Embedding (Roweis and Saul, 2000) and Isomap (de Silva and Tenenbaum, 2002) to the recently introduced BarnesHut-SNE (van der Maaten, 2013). Our library was designed with a focus on performance and flexibility. For performance, we combine efficient multi-core algorithms, modern data structures and state-of-the-art low-level libraries. To achieve flexibility, we designed a clean interface for applying methods to user data and provide a callback API that facilitates integration with the library. The library is freely available as open-source software and is distributed under the permissive BSD 3-clause license. We encourage the integration of Tapkee into other open-source toolboxes and libraries. For example, Tapkee has been integrated into the codebase of the Shogun toolbox (Sonnenburg et al., 2010), giving us access to a rich set of kernels, distance measures and bindings to common programming languages including Python, Octave, Matlab, R, Java, C#, Ruby, Perl and Lua. Source code, examples and documentation are available at http://tapkee.lisitsyn.me. Keywords: dimensionality reduction, machine learning, C++, open source software

3 0.04012033 90 jmlr-2013-Quasi-Newton Method: A New Direction

Author: Philipp Hennig, Martin Kiefel

Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes

4 0.035386525 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang

Abstract: We present BudgetedSVM, an open-source C++ toolbox comprising highly-optimized implementations of recently proposed algorithms for scalable training of Support Vector Machine (SVM) approximators: Adaptive Multi-hyperplane Machines, Low-rank Linearization SVM, and Budgeted Stochastic Gradient Descent. BudgetedSVM trains models with accuracy comparable to LibSVM in time comparable to LibLinear, solving non-linear problems with millions of high-dimensional examples within minutes on a regular computer. We provide command-line and Matlab interfaces to BudgetedSVM, an efficient API for handling large-scale, high-dimensional data sets, as well as detailed documentation to help developers use and further extend the toolbox. Keywords: non-linear classification, large-scale learning, SVM, machine learning toolbox

5 0.035219815 95 jmlr-2013-Ranking Forests

Author: Stéphan Clémençon, Marine Depecker, Nicolas Vayatis

Abstract: The present paper examines how the aggregation and feature randomization principles underlying the algorithm R ANDOM F OREST (Breiman, 2001) can be adapted to bipartite ranking. The approach taken here is based on nonparametric scoring and ROC curve optimization in the sense of the AUC criterion. In this problem, aggregation is used to increase the performance of scoring rules produced by ranking trees, as those developed in Cl´ mencon and Vayatis (2009c). The present work e ¸ describes the principles for building median scoring rules based on concepts from rank aggregation. Consistency results are derived for these aggregated scoring rules and an algorithm called R ANK ING F OREST is presented. Furthermore, various strategies for feature randomization are explored through a series of numerical experiments on artificial data sets. Keywords: bipartite ranking, nonparametric scoring, classification data, ROC optimization, AUC criterion, tree-based ranking rules, bootstrap, bagging, rank aggregation, median ranking, feature randomization

6 0.033119425 37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis

7 0.031940684 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

8 0.030575514 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library

9 0.024634225 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

10 0.022952773 89 jmlr-2013-QuantMiner for Mining Quantitative Association Rules

11 0.020331204 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

12 0.019743735 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

13 0.018998887 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java

14 0.017786562 1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics

15 0.016003553 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

16 0.014674729 22 jmlr-2013-Classifying With Confidence From Incomplete Information

17 0.014516242 8 jmlr-2013-A Theory of Multiclass Boosting

18 0.014375556 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

19 0.014072549 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

20 0.013804732 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.07), (1, 0.003), (2, -0.029), (3, -0.025), (4, 0.014), (5, 0.025), (6, -0.009), (7, 0.0), (8, -0.051), (9, -0.073), (10, 0.082), (11, -0.177), (12, 0.057), (13, -0.082), (14, 0.012), (15, -0.177), (16, -0.013), (17, -0.073), (18, 0.047), (19, -0.004), (20, 0.157), (21, -0.026), (22, -0.013), (23, -0.042), (24, 0.147), (25, -0.013), (26, -0.056), (27, -0.028), (28, -0.037), (29, -0.077), (30, 0.015), (31, -0.04), (32, 0.048), (33, -0.078), (34, -0.082), (35, -0.094), (36, -0.015), (37, -0.244), (38, -0.19), (39, -0.014), (40, 0.037), (41, 0.001), (42, 0.35), (43, 0.162), (44, 0.091), (45, -0.09), (46, 0.093), (47, 0.192), (48, -0.075), (49, 0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97840786 83 jmlr-2013-Orange: Data Mining Toolbox in Python

Author: Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, Blaž Zupan

Abstract: Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming. Here we report on the scripting part, which features interactive data analysis and component-based assembly of data mining procedures. In the selection and design of components, we focus on the flexibility of their reuse: our principal intention is to let the user write simple and clear scripts in Python, which build upon C++ implementations of computationallyintensive tasks. Orange is intended both for experienced users and programmers, as well as for students of data mining. Keywords: Python, data mining, machine learning, toolbox, scripting

2 0.56889749 37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis

Author: Joshua M. Lewis, Virginia R. de Sa, Laurens van der Maaten

Abstract: Divvy is an application for applying unsupervised machine learning techniques (clustering and dimensionality reduction) to the data analysis process. Divvy provides a novel UI that allows researchers to tighten the action-perception loop of changing algorithm parameters and seeing a visualization of the result. Machine learning researchers can use Divvy to publish easy to use reference implementations of their algorithms, which helps the machine learning field have a greater impact on research practices elsewhere. Keywords: clustering, dimensionality reduction, open source software, human computer interaction, data visualization

3 0.39568871 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

Author: Sergey Lisitsyn, Christian Widmer, Fernando J. Iglesias Garcia

Abstract: We present Tapkee, a C++ template library that provides efficient implementations of more than 20 widely used dimensionality reduction techniques ranging from Locally Linear Embedding (Roweis and Saul, 2000) and Isomap (de Silva and Tenenbaum, 2002) to the recently introduced BarnesHut-SNE (van der Maaten, 2013). Our library was designed with a focus on performance and flexibility. For performance, we combine efficient multi-core algorithms, modern data structures and state-of-the-art low-level libraries. To achieve flexibility, we designed a clean interface for applying methods to user data and provide a callback API that facilitates integration with the library. The library is freely available as open-source software and is distributed under the permissive BSD 3-clause license. We encourage the integration of Tapkee into other open-source toolboxes and libraries. For example, Tapkee has been integrated into the codebase of the Shogun toolbox (Sonnenburg et al., 2010), giving us access to a rich set of kernels, distance measures and bindings to common programming languages including Python, Octave, Matlab, R, Java, C#, Ruby, Perl and Lua. Source code, examples and documentation are available at http://tapkee.lisitsyn.me. Keywords: dimensionality reduction, machine learning, C++, open source software

4 0.34163368 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library

Author: Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta, Alexander G. Gray

Abstract: MLPACK is a state-of-the-art, scalable, multi-platform C++ machine learning library released in late 2011 offering both a simple, consistent API accessible to novice users and high performance and flexibility to expert users by leveraging modern features of C++. MLPACK provides cutting-edge algorithms whose benchmarks exhibit far better performance than other leading machine learning libraries. MLPACK version 1.0.3, licensed under the LGPL, is available at http://www.mlpack.org. Keywords: C++, dual-tree algorithms, machine learning software, open source software, largescale learning 1. Introduction and Goals Though several machine learning libraries are freely available online, few, if any, offer efficient algorithms to the average user. For instance, the popular Weka toolkit (Hall et al., 2009) emphasizes ease of use but scales poorly; the distributed Apache Mahout library offers scalability at a cost of higher overhead (such as clusters and powerful servers often unavailable to the average user). Also, few libraries offer breadth; for instance, libsvm (Chang and Lin, 2011) and the Tilburg MemoryBased Learner (TiMBL) are highly scalable and accessible yet each offer only a single method. MLPACK, intended to be the machine learning analog to the general-purpose LAPACK linear algebra library, aims to combine efficiency and accessibility. Written in C++, MLPACK uses the highly efficient Armadillo matrix library (Sanderson, 2010) and is freely available under the GNU Lesser General Public License (LGPL). Through the use of C++ templates, MLPACK both eliminates unnecessary copying of data sets and performs expression optimizations unavailable in other languages. Also, MLPACK is, to our knowledge, unique among existing libraries in using generic programming features of C++ to allow customization of the available machine learning methods without incurring performance penalties. c 2013 Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta and Alexander G. Gray. C URTIN , C LINE , S LAGLE , M ARCH , R AM , M EHTA AND G RAY In addition, users ranging from students to experts should find the consistent, intuitive interface of MLPACK to be highly accessible. Finally, the source code provides references and comprehensive documentation. Four major goals of the development team of MLPACK are • • • • to implement scalable, fast machine learning algorithms, to design an intuitive, consistent, and simple API for non-expert users, to implement a variety of machine learning methods, and to provide cutting-edge machine learning algorithms unavailable elsewhere. This paper offers both an introduction to the simple and extensible API and a glimpse of the superior performance of the library. 2. Package Overview Each algorithm available in MLPACK features both a set of C++ library functions and a standalone command-line executable. Version 1.0.3 includes the following methods: • • • • • • • • • • • • • • nearest/furthest neighbor search with cover trees or kd-trees (k-nearest-neighbors) range search with cover trees or kd-trees Gaussian mixture models (GMMs) hidden Markov models (HMMs) LARS / Lasso regression k-means clustering fast hierarchical clustering (Euclidean MST calculation)1 (March et al., 2010) kernel PCA (and regular PCA) local coordinate coding1 (Yu et al., 2009) sparse coding using dictionary learning RADICAL (Robust, Accurate, Direct ICA aLgorithm) (Learned-Miller and Fisher, 2003) maximum variance unfolding (MVU) via LRSDP1 (Burer and Monteiro, 2003) the naive Bayes classifier density estimation trees1 (Ram and Gray, 2011) The development team manages MLPACK with Subversion and the Trac bug reporting system, allowing easy downloads and simple bug reporting. The entire development process is transparent, so any interested user can easily contribute to the library. MLPACK can compile from source on Linux, Mac OS, and Windows; currently, different Linux distributions are reviewing MLPACK for inclusion in their package managers, which will allow users to install MLPACK without needing to compile from source. 3. A Consistent, Simple API MLPACK features a highly accessible API, both in style (such as consistent naming schemes and coding conventions) and ease of use (such as templated defaults), as well as stringent documentation standards. Consequently, a new user can execute algorithms out-of-the-box often with little or no adjustment to parameters, while the seasoned expert can expect extreme flexibility in algorithmic 1. This algorithm is not available in any other comparable software package. 802 MLPACK: A S CALABLE C++ M ACHINE L EARNING L IBRARY Data Set wine cloud wine-qual isolet miniboone yp-msd corel covtype mnist randu MLPACK 0.0003 0.0069 0.0290 13.0197 20.2045 5430.0478 4.9716 14.3449 2719.8087 1020.9142 Weka 0.0621 0.1174 0.8868 213.4735 216.1469 >9000.0000 14.4264 45.9912 >9000.0000 2665.0921 Shogun 0.0277 0.5000 4.3617 37.6190 2351.4637 >9000.0000 555.9600 >9000.0000 3536.4477 >9000.0000 MATLAB 0.0021 0.0210 0.6465 46.9518 1088.1127 >9000.0000 60.8496 >9000.0000 4838.6747 1679.2893 mlpy 0.0025 0.3520 4.0431 52.0437 3219.2696 >9000.0000 209.5056 >9000.0000 5192.3586 >9000.0000 sklearn 0.0008 0.0192 0.1668 46.8016 714.2385 >9000.0000 160.4597 651.6259 5363.9650 8780.0176 Table 1: k-NN benchmarks (in seconds). Data Set UCI Name Size wine Wine 178x13 cloud Cloud 2048x10 wine-qual Wine Quality 6497x11 isolet ISOLET 7797x617 miniboone MiniBooNE 130064x50 Data Set UCI Name Size yp-msd YearPredictionMSD 515345x90 corel Corel 37749x32 covtype Covertype 581082x54 mnist N/A 70000x784 randu N/A 1000000x10 Table 2: Benchmark data set sizes. tuning. For example, the following line initializes an object which will perform the standard kmeans clustering in Euclidean space: KMeans

5 0.33984041 95 jmlr-2013-Ranking Forests

Author: Stéphan Clémençon, Marine Depecker, Nicolas Vayatis

Abstract: The present paper examines how the aggregation and feature randomization principles underlying the algorithm R ANDOM F OREST (Breiman, 2001) can be adapted to bipartite ranking. The approach taken here is based on nonparametric scoring and ROC curve optimization in the sense of the AUC criterion. In this problem, aggregation is used to increase the performance of scoring rules produced by ranking trees, as those developed in Cl´ mencon and Vayatis (2009c). The present work e ¸ describes the principles for building median scoring rules based on concepts from rank aggregation. Consistency results are derived for these aggregated scoring rules and an algorithm called R ANK ING F OREST is presented. Furthermore, various strategies for feature randomization are explored through a series of numerical experiments on artificial data sets. Keywords: bipartite ranking, nonparametric scoring, classification data, ROC optimization, AUC criterion, tree-based ranking rules, bootstrap, bagging, rank aggregation, median ranking, feature randomization

6 0.2675595 90 jmlr-2013-Quasi-Newton Method: A New Direction

7 0.23793454 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

8 0.15954709 63 jmlr-2013-Learning Trees from Strings: A Strong Learning Algorithm for some Context-Free Grammars

9 0.12942314 22 jmlr-2013-Classifying With Confidence From Incomplete Information

10 0.12879798 106 jmlr-2013-Stationary-Sparse Causality Network Learning

11 0.1261335 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

12 0.12159495 66 jmlr-2013-MAGIC Summoning: Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use

13 0.1142038 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

14 0.11017275 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

15 0.10871064 89 jmlr-2013-QuantMiner for Mining Quantitative Association Rules

16 0.10367414 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies

17 0.10249405 96 jmlr-2013-Regularization-Free Principal Curve Estimation

18 0.10220445 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

19 0.099980041 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

20 0.097843513 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.025), (5, 0.077), (10, 0.03), (20, 0.026), (23, 0.014), (41, 0.611), (44, 0.023), (75, 0.017), (87, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8862654 83 jmlr-2013-Orange: Data Mining Toolbox in Python

Author: Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, Blaž Zupan

Abstract: Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming. Here we report on the scripting part, which features interactive data analysis and component-based assembly of data mining procedures. In the selection and design of components, we focus on the flexibility of their reuse: our principal intention is to let the user write simple and clear scripts in Python, which build upon C++ implementations of computationallyintensive tasks. Orange is intended both for experienced users and programmers, as well as for students of data mining. Keywords: Python, data mining, machine learning, toolbox, scripting

2 0.79786062 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies

Author: Joachim Niehren, Jérôme Champavère, Aurélien Lemay, Rémi Gilleron

Abstract: Inference algorithms for tree automata that define node selecting queries in unranked trees rely on tree pruning strategies. These impose additional assumptions on node selection that are needed to compensate for small numbers of annotated examples. Pruning-based heuristics in query learning algorithms for Web information extraction often boost the learning quality and speed up the learning process. We will distinguish the class of regular queries that are stable under a given schemaguided pruning strategy, and show that this class is learnable with polynomial time and data. Our learning algorithm is obtained by adding pruning heuristics to the traditional learning algorithm for tree automata from positive and negative examples. While justified by a formal learning model, our learning algorithm for stable queries also performs very well in practice of XML information extraction. Keywords: XML information extraction, XML schemas, interactive learning, tree automata, grammatical inference

3 0.58075821 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

Author: Aleksandr Y. Aravkin, James V. Burke, Gianluigi Pillonetto

Abstract: We introduce a new class of quadratic support (QS) functions, many of which already play a crucial role in a variety of applications, including machine learning, robust statistical inference, sparsity promotion, and inverse problems such as Kalman smoothing. Well known examples of QS penalties include the ℓ2 , Huber, ℓ1 and Vapnik losses. We build on a dual representation for QS functions, using it to characterize conditions necessary to interpret these functions as negative logs of true probability densities. This interpretation establishes the foundation for statistical modeling with both known and new QS loss functions, and enables construction of non-smooth multivariate distributions with specified means and variances from simple scalar building blocks. The main contribution of this paper is a flexible statistical modeling framework for a variety of learning applications, together with a toolbox of efficient numerical methods for estimation. In particular, a broad subclass of QS loss functions known as piecewise linear quadratic (PLQ) penalties has a dual representation that can be exploited to design interior point (IP) methods. IP methods solve nonsmooth optimization problems by working directly with smooth systems of equations characterizing their optimality. We provide several numerical examples, along with a code that can be used to solve general PLQ problems. The efficiency of the IP approach depends on the structure of particular applications. We consider the class of dynamic inverse problems using Kalman smoothing. This class comprises a wide variety of applications, where the aim is to reconstruct the state of a dynamical system with known process and measurement models starting from noisy output samples. In the classical case, Gaus∗. The authors would like to thank Bradley M. Bell for insightful discussions and helpful suggestions. The research leading to these results has received funding from the European Union Seventh Framework Programme [FP7/2007-2013

4 0.1773674 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents

Author: Wei Wu, Zhengdong Lu, Hang Li

Abstract: The task of matching data from two heterogeneous domains naturally arises in various areas such as web search, collaborative filtering, and drug design. In web search, existing work has designed relevance models to match queries and documents by exploiting either user clicks or content of queries and documents. To the best of our knowledge, however, there has been little work on principled approaches to leveraging both clicks and content to learn a matching model for search. In this paper, we propose a framework for learning to match heterogeneous objects. The framework learns two linear mappings for two objects respectively, and matches them via the dot product of their images after mapping. Moreover, when different regularizations are enforced, the framework renders a rich family of matching models. With orthonormal constraints on mapping functions, the framework subsumes Partial Least Squares (PLS) as a special case. Alternatively, with a ℓ1 +ℓ2 regularization, we obtain a new model called Regularized Mapping to Latent Structures (RMLS). RMLS enjoys many advantages over PLS, including lower time complexity and easy parallelization. To further understand the matching framework, we conduct generalization analysis and apply the result to both PLS and RMLS. We apply the framework to web search and implement both PLS and RMLS using a click-through bipartite with metadata representing features of queries and documents. We test the efficacy and scalability of RMLS and PLS on large scale web search problems. The results show that both PLS and RMLS can significantly outperform baseline methods, while RMLS substantially speeds up the learning process. Keywords: web search, partial least squares, regularized mapping to latent structures, generalization analysis

5 0.16195486 73 jmlr-2013-Multicategory Large-Margin Unified Machines

Author: Chong Zhang, Yufeng Liu

Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine

6 0.15908641 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

7 0.15725705 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

8 0.14813644 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

9 0.14586858 78 jmlr-2013-On the Learnability of Shuffle Ideals

10 0.14520934 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

11 0.14412634 59 jmlr-2013-Large-scale SVD and Manifold Learning

12 0.14368108 1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics

13 0.1415216 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

14 0.14139265 15 jmlr-2013-Bayesian Canonical Correlation Analysis

15 0.1407772 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

16 0.14008898 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

17 0.13979472 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

18 0.13844058 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

19 0.13827026 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

20 0.13826536 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression