jmlr jmlr2012 jmlr2012-90 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tom De Smedt, Walter Daelemans
Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern. Keywords: Python, data mining, natural language processing, machine learning, graph networks
Reference: text
sentIndex sentText sentNum sentScore
1 BE CLiPS Computational Linguistics Group University of Antwerp 2000 Antwerp, Belgium Editor: Cheng Soon Ong Abstract Pattern is a package for Python 2. [sent-7, score-0.066]
2 It is well documented and bundled with 30+ examples and 350+ unit tests. [sent-9, score-0.115]
3 The source code is licensed under BSD and available from http://www. [sent-10, score-0.142]
4 Keywords: Python, data mining, natural language processing, machine learning, graph networks 1. [sent-15, score-0.117]
5 Introduction The World Wide Web is an immense collection of linguistic information that has in the last decade gathered attention as a valuable resource for tasks such as machine translation, opinion mining and trend detection, that is, “Web as Corpus” (Kilgarriff and Grefenstette, 2003). [sent-16, score-0.048]
6 This use of the WWW poses a challenge since the Web is interspersed with code (HTML markup) and lacks metadata (language identification, part-of-speech tags, semantic labels). [sent-17, score-0.121]
7 “Pattern” (BSD license) is a Python package for web mining, natural language processing, machine learning and network analysis, with a focus on ease-of-use. [sent-18, score-0.312]
8 It offers a mash-up of tools often used when harnessing the Web as a corpus, which usually requires several independent toolkits chained together in a practical application. [sent-19, score-0.183]
9 Several such toolkits with a user interface exist in the scientific community, for example ORANGE (Demˇar et al. [sent-20, score-0.086]
10 By contrast, PATTERN is more related to toolkits such as NLTK (Bird et al. [sent-23, score-0.086]
11 The package aims to be useful to both a scientific and a non-scientific audience. [sent-28, score-0.066]
12 We believe that PATTERN is valuable as a learning environment for students, as a rapid development framework for web developers, and in research projects with a short development cycle. [sent-32, score-0.24]
13 Text is mined from the web and searched by syntax and semantics. [sent-35, score-0.374]
14 Package Overview PATTERN is organized in separate modules that can be chained together, as shown in Figure 1. [sent-38, score-0.101]
15 web Tools for web data mining, using a download mechanism that supports caching, proxies, asynchronous requests and redirection. [sent-45, score-0.332]
16 A SearchEngine class provides a uniform API to multiple web services: Google, Bing, Yahoo! [sent-46, score-0.166]
17 , Twitter, Wikipedia, Flickr and news feeds using FEED PARSER (packages. [sent-47, score-0.04]
18 The module includes an HTML parser based on BEAUTIFUL SOUP (crummy. [sent-50, score-0.407]
19 com/software/beautifulsoup), a PDF parser based on PDF M INER (unixuser. [sent-51, score-0.259]
20 en Fast, regular expressions-based shallow parser for English (identifies sentence constituents, e. [sent-54, score-0.472]
21 , nouns, verbs), using a finite state part-of-speech tagger (Brill, 1992) extended with a tokenizer, lemmatizer and chunker. [sent-56, score-0.075]
22 A parser with higher accuracy (MBSP) can be plugged in. [sent-58, score-0.259]
23 The module has a Sentence class for parse tree traversal, functions for singularization/pluralization (Conway, 1998), conjugation, modality and sentiment analysis. [sent-59, score-0.344]
24 It comes bundled with WORDNET 3 (Fellbaum, 1998) and PYWORDNET. [sent-60, score-0.115]
25 en for Dutch, using the BRILL - NL language model (Geertzen, 2010). [sent-63, score-0.08]
26 Contributors are encouraged to read the developer documentation on how to add support for other languages. [sent-64, score-0.143]
27 Documents are lemmatized bag-of-words that can be grouped in a sparse corpus to compute TF-IDF, distance metrics (cosine, Euclidean, Manhattan, Hamming) and dimension reduction (Latent Semantic Analysis). [sent-75, score-0.186]
28 The module includes a hierarchical and a k-means clustering algorithm, optimized with the kmeans++ initialization algorithm (Arthur and Vassilvitskii, 2007) and triangle inequality (Elkan, 2003). [sent-76, score-0.148]
29 A Naive Bayes, a k-NN, and a SVM classifier using LIBSVM (Chang and Li, 2011) are included, with tools for feature selection (information gain) and K-fold cross validation. [sent-77, score-0.033]
30 graph Graph data structure using Node, Edge and Graph classes, useful (for example) for modeling semantic networks. [sent-79, score-0.091]
31 The module has algorithms for shortest path finding, subgraph partitioning, eigenvector centrality and betweenness centrality (Brandes, 2001). [sent-80, score-0.468]
32 The module has a force-based layout algorithm that positions nodes in 2D space. [sent-82, score-0.148]
33 Evaluation metrics including a code profiler, functions for accuracy, precision and recall, confusion matrix, inter-rater agreement (Fleiss’ kappa), string similarity (Levenshtein, Dice) and readability (Flesch). [sent-87, score-0.114]
34 Example Script As an example, we chain together four PATTERN modules to train a k-NN classifier on adjectives mined from Twitter. [sent-91, score-0.224]
35 First, we mine 1,500 tweets with the hashtag #win or #fail (our classes), for example: “$20 tip off a sweet little old lady today #win”. [sent-92, score-0.075]
36 We parse the part-of-speech tags for each tweet, keeping adjectives. [sent-93, score-0.158]
37 We group the adjective vectors in a corpus and use it to train the classifier. [sent-94, score-0.335]
38 vector import import import import Twitter Sentence, parse search Document, Corpus, KNN corpus = Corpus() for i in range(1,15): for tweet in Twitter(). [sent-103, score-0.729]
39 lower() s = Sentence(parse(s)) s = search('JJ', s) # JJ = adjective s = [match[0]. [sent-108, score-0.149]
40 append(Document(s, type=p)) classifier = KNN() for document in corpus: classifier. [sent-111, score-0.095]
41 classify('stupid') # yields 'FAIL'd Figure 2: Example source code for a k-NN classifier trained on Twitter messages. [sent-114, score-0.142]
42 Case Study As a case study, we used PATTERN to create a Dutch sentiment lexicon (De Smedt and Daelemans, 2012). [sent-116, score-0.239]
43 We mined online Dutch book reviews and extracted the 1,000 most frequent adjectives. [sent-117, score-0.147]
44 These were manually annotated with positivity, negativity, and subjectivity scores. [sent-118, score-0.178]
45 , 2007) we extracted the most frequent nouns and the adjectives preceding those nouns. [sent-121, score-0.222]
46 This results in a vector space with approximately 5,750 adjective vectors with nouns as features. [sent-122, score-0.261]
47 For each annotated adjective we then computed k-NN and inherited its scores to neighbor adjectives. [sent-123, score-0.215]
48 Documentation PATTERN comes bundled with examples and unit tests. [sent-127, score-0.115]
49 The documentation contains a quick overview, installation instructions, and for each module a detailed page with the API reference, examples of use and a discussion of the scientific principles. [sent-128, score-0.234]
50 The documentation assumes no prior knowledge, except for a background in Python programming. [sent-129, score-0.086]
51 Source Code is written in pure Python, meaning that we sacrifice performance for development speed and readability (i. [sent-133, score-0.084]
52 The package runs on all platforms and has no dependencies, with the exception of NumPy when LSA is used. [sent-136, score-0.066]
53 The source code is released under a BSD license, so it can be incorporated into proprietary products or used in combination with other open source packages such as SCRAPY (web mining), NLTK (natural language processing), PYBRAIN and PYML (machine learning) and NETWORKX (network analysis). [sent-141, score-0.297]
54 , 2010), a robust, memory-based shallow parser built on the TIMBL machine learning software. [sent-143, score-0.355]
55 The API’s for the PATTERN parser and MBSP are identical. [sent-144, score-0.259]
56 Gephi: An open source software for exploring and manipulating networks. [sent-150, score-0.075]
57 Jeroen geertzen :: software & demos : Brill-nl, June 2010. [sent-184, score-0.075]
58 Introduction to the special issue on the web as corpus. [sent-191, score-0.166]
59 A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. [sent-197, score-0.112]
wordName wordTfidf (topN-words)
[('parser', 0.259), ('smedt', 0.224), ('python', 0.187), ('daelemans', 0.187), ('dutch', 0.187), ('twitter', 0.187), ('corpus', 0.186), ('web', 0.166), ('walter', 0.16), ('adjective', 0.149), ('module', 0.148), ('tom', 0.133), ('centrality', 0.128), ('sentiment', 0.124), ('win', 0.124), ('sentence', 0.117), ('lexicon', 0.115), ('bundled', 0.115), ('brill', 0.112), ('mbsp', 0.112), ('mined', 0.112), ('nouns', 0.112), ('subjectivity', 0.112), ('wikipedia', 0.112), ('html', 0.106), ('import', 0.099), ('bsd', 0.096), ('syntax', 0.096), ('shallow', 0.096), ('wordnet', 0.096), ('documentation', 0.086), ('tags', 0.086), ('toolkits', 0.086), ('language', 0.08), ('source', 0.075), ('adjectives', 0.075), ('antwerp', 0.075), ('bastian', 0.075), ('geertzen', 0.075), ('gephi', 0.075), ('hagberg', 0.075), ('jeroen', 0.075), ('kilgarriff', 0.075), ('mathieu', 0.075), ('medt', 0.075), ('networkx', 0.075), ('nltk', 0.075), ('ordelman', 0.075), ('sweet', 0.075), ('tagger', 0.075), ('tweet', 0.075), ('twnc', 0.075), ('pattern', 0.073), ('parse', 0.072), ('code', 0.067), ('annotated', 0.066), ('api', 0.066), ('package', 0.066), ('beautiful', 0.064), ('betweenness', 0.064), ('ua', 0.064), ('pybrain', 0.064), ('schaul', 0.064), ('print', 0.064), ('ython', 0.064), ('chained', 0.064), ('document', 0.063), ('bird', 0.057), ('english', 0.057), ('developer', 0.057), ('clips', 0.057), ('pang', 0.057), ('semantic', 0.054), ('dem', 0.05), ('pdf', 0.05), ('mining', 0.048), ('readability', 0.047), ('knn', 0.047), ('orange', 0.042), ('news', 0.04), ('license', 0.04), ('development', 0.037), ('modules', 0.037), ('graph', 0.037), ('arthur', 0.036), ('naive', 0.036), ('google', 0.036), ('linguistics', 0.035), ('frequent', 0.035), ('chang', 0.033), ('libsvm', 0.033), ('tools', 0.033), ('ar', 0.032), ('asch', 0.032), ('parsed', 0.032), ('terribly', 0.032), ('vassilvitskii', 0.032), ('classifier', 0.032), ('contributors', 0.032), ('reilly', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 90 jmlr-2012-Pattern for Python
Author: Tom De Smedt, Walter Daelemans
Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern. Keywords: Python, data mining, natural language processing, machine learning, graph networks
2 0.065119788 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy
Author: Félix-Antoine Fortin, François-Michel De Rainville, Marc-André Gardner, Marc Parizeau, Christian Gagné
Abstract: DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. Its design departs from most other existing frameworks in that it seeks to make algorithms explicit and data structures transparent, as opposed to the more common black-box frameworks. Freely available with extensive documentation at http://deap.gel.ulaval.ca, DEAP is an open source project under an LGPL license. Keywords: distributed evolutionary algorithms, software tools
3 0.061070994 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development
Author: Stephen Gould
Abstract: We present an open-source platform-independent C++ framework for machine learning and computer vision research. The framework includes a wide range of standard machine learning and graphical models algorithms as well as reference implementations for many machine learning and computer vision applications. The framework contains Matlab wrappers for core components of the library and an experimental graphical user interface for developing and visualizing machine learning data flows. Keywords: machine learning, graphical models, computer vision, open-source software
4 0.057863012 75 jmlr-2012-NIMFA : A Python Library for Nonnegative Matrix Factorization
Author: Marinka Žitnik, Blaž Zupan
Abstract: NIMFA is an open-source Python library that provides a unified interface to nonnegative matrix factorization algorithms. It includes implementations of state-of-the-art factorization methods, initialization approaches, and quality scoring. It supports both dense and sparse matrix representation. NIMFA’s component-based implementation and hierarchical design should help the users to employ already implemented techniques or design and code new strategies for matrix factorization tasks. Keywords: nonnegative matrix factorization, initialization methods, quality measures, scripting, Python
5 0.056199707 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
Author: Stephen R. Piccolo, Lewis J. Frey
Abstract: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classification, parallel, ensemble, reproducible research
6 0.045538414 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition
7 0.04547051 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
8 0.043376967 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
9 0.043310747 79 jmlr-2012-Oger: Modular Learning Architectures For Large-Scale Sequential Processing
10 0.038679823 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
11 0.036372881 88 jmlr-2012-PREA: Personalized Recommendation Algorithms Toolkit
12 0.03484026 62 jmlr-2012-MULTIBOOST: A Multi-purpose Boosting Package
13 0.032323606 113 jmlr-2012-The huge Package for High-dimensional Undirected Graph Estimation in R
14 0.02988068 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces
15 0.027084816 106 jmlr-2012-Sign Language Recognition using Sub-Units
16 0.026193706 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox
17 0.023456959 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
18 0.022402655 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features
19 0.021315718 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
20 0.019793747 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
topicId topicWeight
[(0, -0.088), (1, 0.024), (2, 0.176), (3, -0.033), (4, 0.01), (5, 0.059), (6, 0.145), (7, -0.018), (8, -0.026), (9, -0.052), (10, -0.028), (11, -0.048), (12, 0.155), (13, -0.145), (14, -0.086), (15, 0.017), (16, 0.026), (17, -0.104), (18, -0.149), (19, -0.034), (20, -0.036), (21, 0.042), (22, -0.085), (23, -0.017), (24, -0.02), (25, -0.222), (26, 0.067), (27, -0.042), (28, -0.018), (29, 0.053), (30, -0.085), (31, 0.011), (32, -0.026), (33, 0.019), (34, -0.029), (35, -0.186), (36, -0.009), (37, -0.027), (38, 0.134), (39, 0.17), (40, 0.096), (41, -0.064), (42, 0.124), (43, 0.28), (44, -0.03), (45, -0.069), (46, 0.216), (47, -0.061), (48, -0.048), (49, 0.159)]
simIndex simValue paperId paperTitle
same-paper 1 0.97624326 90 jmlr-2012-Pattern for Python
Author: Tom De Smedt, Walter Daelemans
Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern. Keywords: Python, data mining, natural language processing, machine learning, graph networks
2 0.50487399 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy
Author: Félix-Antoine Fortin, François-Michel De Rainville, Marc-André Gardner, Marc Parizeau, Christian Gagné
Abstract: DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. Its design departs from most other existing frameworks in that it seeks to make algorithms explicit and data structures transparent, as opposed to the more common black-box frameworks. Freely available with extensive documentation at http://deap.gel.ulaval.ca, DEAP is an open source project under an LGPL license. Keywords: distributed evolutionary algorithms, software tools
3 0.43649286 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development
Author: Stephen Gould
Abstract: We present an open-source platform-independent C++ framework for machine learning and computer vision research. The framework includes a wide range of standard machine learning and graphical models algorithms as well as reference implementations for many machine learning and computer vision applications. The framework contains Matlab wrappers for core components of the library and an experimental graphical user interface for developing and visualizing machine learning data flows. Keywords: machine learning, graphical models, computer vision, open-source software
4 0.42915043 79 jmlr-2012-Oger: Modular Learning Architectures For Large-Scale Sequential Processing
Author: David Verstraeten, Benjamin Schrauwen, Sander Dieleman, Philemon Brakel, Pieter Buteneers, Dejan Pecevski
Abstract: Oger (OrGanic Environment for Reservoir computing) is a Python toolbox for building, training and evaluating modular learning architectures on large data sets. It builds on MDP for its modularity, and adds processing of sequential data sets, gradient descent training, several crossvalidation schemes and parallel parameter optimization methods. Additionally, several learning algorithms are implemented, such as different reservoir implementations (both sigmoid and spiking), ridge regression, conditional restricted Boltzmann machine (CRBM) and others, including GPU accelerated versions. Oger is released under the GNU LGPL, and is available from http: //organic.elis.ugent.be/oger. Keywords: Python, modular architectures, sequential processing
5 0.38299778 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
Author: Stephen R. Piccolo, Lewis J. Frey
Abstract: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classification, parallel, ensemble, reproducible research
6 0.28488445 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
7 0.27874371 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition
8 0.27789208 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
9 0.2370902 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks
10 0.22679386 75 jmlr-2012-NIMFA : A Python Library for Nonnegative Matrix Factorization
11 0.21017885 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces
12 0.1786876 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
13 0.17619944 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
14 0.16634397 62 jmlr-2012-MULTIBOOST: A Multi-purpose Boosting Package
15 0.1458372 106 jmlr-2012-Sign Language Recognition using Sub-Units
16 0.13946249 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
18 0.12929888 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
19 0.1236542 113 jmlr-2012-The huge Package for High-dimensional Undirected Graph Estimation in R
20 0.11957286 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory
topicId topicWeight
[(21, 0.012), (26, 0.02), (27, 0.684), (29, 0.019), (35, 0.011), (49, 0.015), (56, 0.041), (57, 0.011), (69, 0.015), (75, 0.012), (77, 0.018), (92, 0.019), (96, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.95144826 90 jmlr-2012-Pattern for Python
Author: Tom De Smedt, Walter Daelemans
Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern. Keywords: Python, data mining, natural language processing, machine learning, graph networks
2 0.78899145 3 jmlr-2012-A Geometric Approach to Sample Compression
Author: Benjamin I.P. Rubinstein, J. Hyam Rubinstein
Abstract: The Sample Compression Conjecture of Littlestone & Warmuth has remained unsolved for a quarter century. While maximum classes (concept classes meeting Sauer’s Lemma with equality) can be compressed, the compression of general concept classes reduces to compressing maximal classes (classes that cannot be expanded without increasing VC dimension). Two promising ways forward are: embedding maximal classes into maximum classes with at most a polynomial increase to VC dimension, and compression via operating on geometric representations. This paper presents positive results on the latter approach and a first negative result on the former, through a systematic investigation of finite maximum classes. Simple arrangements of hyperplanes in hyperbolic space are shown to represent maximum classes, generalizing the corresponding Euclidean result. We show that sweeping a generic hyperplane across such arrangements forms an unlabeled compression scheme of size VC dimension and corresponds to a special case of peeling the one-inclusion graph, resolving a recent conjecture of Kuzmin & Warmuth. A bijection between finite maximum classes and certain arrangements of piecewise-linear (PL) hyperplanes in either a ball or Euclidean space is established. Finally we show that d-maximum classes corresponding to PL-hyperplane arrangements in Rd have cubical complexes homeomorphic to a d-ball, or equivalently complexes that are manifolds with boundary. A main result is that PL arrangements can be swept by a moving hyperplane to unlabeled d-compress any finite maximum class, forming a peeling scheme as conjectured by Kuzmin & Warmuth. A corollary is that some d-maximal classes cannot be embedded into any maximum class of VC-dimension d + k, for any constant k. The construction of the PL sweeping involves Pachner moves on the one-inclusion graph, corresponding to moves of a hyperplane across the intersection of d other hyperplanes. This extends the well known Pachner moves for triangulations to c
3 0.54833895 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
Author: Trinh Minh Tri Do, Thierry Artières
Abstract: Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efficient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efficient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large number of modern machine learning problems such as logistic regression, conditional random fields, large margin estimation, etc. We propose a novel algorithm for minimizing a regularized objective that is able to handle convex and non-convex, smooth and non-smooth risks. The algorithm is based on the cutting plane technique and on the idea of exploiting the regularization term in the objective function. It may be thought as a limited memory extension of convex regularized bundle methods for dealing with convex and non convex risks. In case the risk is convex the algorithm is proved to converge to a stationary solution with accuracy ε with a rate O(1/λε) where λ is the regularization parameter of the objective function under the assumption of a Lipschitz empirical risk. In case the risk is not convex getting such a proof is more difficult and requires a stronger and more disputable assumption. Yet we provide experimental results on artificial test problems, and on five standard and difficult machine learning problems that are cast as convex and non-convex optimization problems that show how our algorithm compares well in practice with state of the art optimization algorithms. Keywords: optimization, non-convex, non-smooth, cutting plane, bundle method, regularized risk
4 0.20652016 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development
Author: Stephen Gould
Abstract: We present an open-source platform-independent C++ framework for machine learning and computer vision research. The framework includes a wide range of standard machine learning and graphical models algorithms as well as reference implementations for many machine learning and computer vision applications. The framework contains Matlab wrappers for core components of the library and an experimental graphical user interface for developing and visualizing machine learning data flows. Keywords: machine learning, graphical models, computer vision, open-source software
5 0.18807934 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
Author: Sunita Nayak, Kester Duncan, Sudeep Sarkar, Barbara Loeding
Abstract: We present a probabilistic framework to automatically learn models of recurring signs from multiple sign language video sequences containing the vocabulary of interest. We extract the parts of the signs that are present in most occurrences of the sign in context and are robust to the variations produced by adjacent signs. Each sentence video is first transformed into a multidimensional time series representation, capturing the motion and shape aspects of the sign. Skin color blobs are extracted from frames of color video sequences, and a probabilistic relational distribution is formed for each frame using the contour and edge pixels from the skin blobs. Each sentence is represented as a trajectory in a low dimensional space called the space of relational distributions. Given these time series trajectories, we extract signemes from multiple sentences concurrently using iterated conditional modes (ICM). We show results by learning single signs from a collection of sentences with one common pervading sign, multiple signs from a collection of sentences with more than one common sign, and single signs from a mixed collection of sentences. The extracted signemes demonstrate that our approach is robust to some extent to the variations produced within a sign due to different contexts. We also show results whereby these learned sign models are used for spotting signs in test sequences. Keywords: pattern extraction, sign language recognition, signeme extraction, sign modeling, iterated conditional modes
6 0.17933583 75 jmlr-2012-NIMFA : A Python Library for Nonnegative Matrix Factorization
7 0.16337094 53 jmlr-2012-Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences
8 0.16323012 79 jmlr-2012-Oger: Modular Learning Architectures For Large-Scale Sequential Processing
9 0.16305985 106 jmlr-2012-Sign Language Recognition using Sub-Units
10 0.15412073 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
11 0.14425299 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox
12 0.14405674 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
13 0.14204963 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features
14 0.13526312 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
15 0.13362026 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
16 0.13356625 62 jmlr-2012-MULTIBOOST: A Multi-purpose Boosting Package
17 0.12661016 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
18 0.12627102 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space
19 0.12624663 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
20 0.12574027 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers