jmlr jmlr2008 knowledge-graph by maker-knowledge-mining
1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park
Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classification, convexity lemma, Radon transform
Author: Stefan Klanke, Sethu Vijayakumar, Stefan Schaal
Abstract: In this paper we introduce an improved implementation of locally weighted projection regression (LWPR), a supervised learning algorithm that is capable of handling high-dimensional input data. As the key features, our code supports multi-threading, is available for multiple platforms, and provides wrappers for several programming languages. Keywords: regression, local learning, online learning, C, C++, Matlab, Octave, Python
3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters
Author: Zach Jorgensen, Yan Zhou, Meador Inge
Abstract: Statistical spam filters are known to be vulnerable to adversarial attacks. One of the more common adversarial attacks, known as the good word attack, thwarts spam filters by appending to spam messages sets of “good” words, which are words that are common in legitimate email but rare in spam. We present a counterattack strategy that attempts to differentiate spam from legitimate email in the input space by transforming each email into a bag of multiple segments, and subsequently applying multiple instance logistic regression on the bags. We treat each segment in the bag as an instance. An email is classified as spam if at least one instance in the corresponding bag is spam, and as legitimate if all the instances in it are legitimate. We show that a classifier using our multiple instance counterattack strategy is more robust to good word attacks than its single instance counterpart and other single instance learners commonly used in the spam filtering domain. Keywords: spam filtering, multiple instance learning, good word attack, adversarial learning
5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
Author: Arnak S. Dalalyan, Anatoly Juditsky, Vladimir Spokoiny
Abstract: The statistical problem of estimating the effective dimension-reduction (EDR) subspace in the multi-index regression model with deterministic design and additive noise is considered. A new procedure for recovering the directions of the EDR subspace is proposed. Many methods for estimating the EDR subspace perform principal component analysis on a family of vectors, say ˆ ˆ β1 , . . . , βL , nearly lying in the EDR subspace. This is in particular the case for the structure-adaptive approach proposed by Hristache et al. (2001a). In the present work, we propose to estimate the projector onto the EDR subspace by the solution to the optimization problem minimize ˆ ˆ max β (I − A)β =1,...,L subject to A ∈ Am∗ , where Am∗ is the set of all symmetric matrices with eigenvalues in [0, 1] and trace less than or equal √ to m∗ , with m∗ being the true structural dimension. Under mild assumptions, n-consistency of the proposed procedure is proved (up to a logarithmic factor) in the case when the structural dimension is not larger than 4. Moreover, the stochastic error of the estimator of the projector onto the EDR subspace is shown to depend on L logarithmically. This enables us to use a large number of vectors ˆ β for estimating the EDR subspace. The empirical behavior of the algorithm is studied through numerical simulations. Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, central subspace
6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
Author: Xianchao Xie, Zhi Geng
Abstract: In this paper, we propose a recursive method for structural learning of directed acyclic graphs (DAGs), in which a problem of structural learning for a large DAG is first decomposed into two problems of structural learning for two small vertex subsets, each of which is then decomposed recursively into two problems of smaller subsets until none subset can be decomposed further. In our approach, search for separators of a pair of variables in a large DAG is localized to small subsets, and thus the approach can improve the efficiency of searches and the power of statistical tests for structural learning. We show how the recent advances in the learning of undirected graphical models can be employed to facilitate the decomposition. Simulations are given to demonstrate the performance of the proposed method. Keywords: Bayesian network, conditional independence, decomposition, directed acyclic graph, structural learning
7 jmlr-2008-A Tutorial on Conformal Prediction
Author: Glenn Shafer, Vladimir Vovk
Abstract: Conformal prediction uses past experience to determine precise levels of conÄ?Ĺš dence in new predictions. Given an error probability ĂŽÄž, together with a method that makes a prediction y of a label Ă‹† y, it produces a set of labels, typically containing y, that also contains y with probability 1 − ĂŽÄž. Ă‹† Conformal prediction can be applied to any method for producing y: a nearest-neighbor method, a Ă‹† support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 − ĂŽÄž of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005). Keywords: conÄ?Ĺš dence, on-line compression modeling, on-line learning, prediction regions
8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
Author: Faustino Gomez, Jürgen Schmidhuber, Risto Miikkulainen
Abstract: Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difficult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning (RL) approaches have made progress by using direct interaction with the task environment, but have so far not scaled well to large state spaces and environments that are not fully observable. In recent years, neuroevolution, the artificial evolution of neural networks, has had remarkable success in tasks that exhibit these two properties. In this paper, we compare a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difficult versions of the pole balancing problem that involve large (continuous) state spaces and hidden state. CoSyNE is shown to be significantly more efficient and powerful than the other methods on these tasks. Keywords: coevolution, recurrent neural networks, non-linear control, genetic algorithms, experimental comparison
9 jmlr-2008-Active Learning by Spherical Subdivision
Author: Falk-Florian Henrich, Klaus Obermayer
Abstract: We introduce a computationally feasible, “constructive” active learning method for binary classification. The learning algorithm is initially formulated for separable classification problems, for a hyperspherical data space with constant data density, and for great spheres as classifiers. In order to reduce computational complexity the version space is restricted to spherical simplices and learning procedes by subdividing the edges of maximal length. We show that this procedure optimally reduces a tight upper bound on the generalization error. The method is then extended to other separable classification problems using products of spheres as data spaces and isometries induced by charts of the sphere. An upper bound is provided for the probability of disagreement between classifiers (hence the generalization error) for non-constant data densities on the sphere. The emphasis of this work lies on providing mathematically exact performance estimates for active learning strategies. Keywords: active learning, spherical subdivision, error bounds, simplex halving
Author: Yang-Bo He, Zhi Geng
Abstract: The causal discovery from data is important for various scientific investigations. Because we cannot distinguish the different directed acyclic graphs (DAGs) in a Markov equivalence class learned from observational data, we have to collect further information on causal structures from experiments with external interventions. In this paper, we propose an active learning approach for discovering causal structures in which we first find a Markov equivalence class from observational data, and then we orient undirected edges in every chain component via intervention experiments separately. In the experiments, some variables are manipulated through external interventions. We discuss two kinds of intervention experiments, randomized experiment and quasi-experiment. Furthermore, we give two optimal designs of experiments, a batch-intervention design and a sequential-intervention design, to minimize the number of manipulated variables and the set of candidate structures based on the minimax and the maximum entropy criteria. We show theoretically that structural learning can be done locally in subgraphs of chain components without need of checking illegal v-structures and cycles in the whole network and that a Markov equivalence subclass obtained after each intervention can still be depicted as a chain graph. Keywords: active learning, causal networks, directed acyclic graphs, intervention, Markov equivalence class, optimal design, structural learning
11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
Author: Sébastien Loustau
Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classifier with no tuning parameter. It is a combination of SVM classifiers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classification, support vector machines, learning rates, approximation, aggregation of classifiers
12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
Author: Suhrid Balakrishnan, David Madigan
Abstract: Classifiers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classifiers, etc., provide competitive methods for classification problems in high dimensions. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multipass algorithms for training sparse linear classifiers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function. Keywords: Laplace approximation, expectation propagation, LASSO
13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
Author: Eric Bax, Augusto Callejas
Abstract: This paper introduces a new PAC transductive error bound for classification. The method uses information from the training examples and inputs of working examples to develop a set of likely assignments to outputs of the working examples. A likely assignment with maximum error determines the bound. The method is very effective for small data sets. Keywords: error bound, transduction, nearest neighbor, dynamic programming
Author: Salvador García, Francisco Herrera
Abstract: In a recently published paper in JMLR, Demˇar (2006) recommends a set of non-parametric stas tistical tests and procedures which can be safely used for comparing the performance of classifiers over multiple data sets. After studying the paper, we realize that the paper correctly introduces the basic procedures and some of the most advanced ones when comparing a control method. However, it does not deal with some advanced topics in depth. Regarding these topics, we focus on more powerful proposals of statistical procedures for comparing n × n classifiers. Moreover, we illustrate an easy way of obtaining adjusted and comparable p-values in multiple comparison procedures. Keywords: statistical methods, non-parametric test, multiple comparisons tests, adjusted p-values, logically related hypotheses
Author: Gerda Claeskens, Christophe Croux, Johan Van Kerckhoven
Abstract: Support vector machines for classification have the advantage that the curse of dimensionality is circumvented. It has been shown that a reduction of the dimension of the input space leads to even better results. For this purpose, we propose two information criteria which can be computed directly from the definition of the support vector machine. We assess the predictive performance of the models selected by our new criteria and compare them to existing variable selection techniques in a simulation study. The simulation results show that the new criteria are competitive in terms of generalization error rate while being much easier to compute. We arrive at the same findings for comparison on some real-world benchmark data sets. Keywords: information criterion, supervised classification, support vector machine, variable selection
16 jmlr-2008-Approximations for Binary Gaussian Process Classification
Author: Hannes Nickisch, Carl Edward Rasmussen
Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classification, Laplaces’s approximation, expectation propagation, variational bounding, mean field methods, marginal likelihood evidence, MCMC
17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
Author: David C. Hoyle
Abstract: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞ at fixed ratio α = N/d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. Keywords: PCA, Bayesian model selection, random matrix theory, high dimensional inference
18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
Author: Andreas Christmann, Arnout Van Messem
Abstract: We investigate robustness properties for a broad class of support vector machines with non-smooth loss functions. These kernel methods are inspired by convex risk minimization in infinite dimensional Hilbert spaces. Leading examples are the support vector machine based on the ε-insensitive loss function, and kernel based quantile regression based on the pinball loss function. Firstly, we propose with the Bouligand influence function (BIF) a modification of F.R. Hampel’s influence function. The BIF has the advantage of being positive homogeneous which is in general not true for Hampel’s influence function. Secondly, we show that many support vector machines based on a Lipschitz continuous loss function and a bounded kernel have a bounded BIF and are thus robust in the sense of robust statistics based on influence functions. Keywords: Bouligand derivatives, empirical risk minimization, influence function, robustness, support vector machines
20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
Author: Jiji Zhang
Abstract: Causal reasoning is primarily concerned with what would happen to a system under external interventions. In particular, we are often interested in predicting the probability distribution of some random variables that would result if some other variables were forced to take certain values. One prominent approach to tackling this problem is based on causal Bayesian networks, using directed acyclic graphs as causal diagrams to relate post-intervention probabilities to pre-intervention probabilities that are estimable from observational data. However, such causal diagrams are seldom fully testable given observational data. In consequence, many causal discovery algorithms based on data-mining can only output an equivalence class of causal diagrams (rather than a single one). This paper is concerned with causal reasoning given an equivalence class of causal diagrams, represented by a (partial) ancestral graph. We present two main results. The first result extends Pearl (1995)’s celebrated do-calculus to the context of ancestral graphs. In the second result, we focus on a key component of Pearl’s calculus—the property of invariance under interventions, and give stronger graphical conditions for this property than those implied by the first result. The second result also improves the earlier, similar results due to Spirtes et al. (1993). Keywords: ancestral graphs, causal Bayesian network, do-calculus, intervention
21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
22 jmlr-2008-Closed Sets for Labeled Data
24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
26 jmlr-2008-Consistency of Trace Norm Minimization
27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
37 jmlr-2008-Forecasting Web Page Views: Methods and Observations
38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
47 jmlr-2008-Learning Balls of Strings from Edit Corrections
48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
52 jmlr-2008-Learning from Multiple Sources
53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression
54 jmlr-2008-Learning to Select Features using their Properties
55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
56 jmlr-2008-Magic Moments for Structured Output Prediction
57 jmlr-2008-Manifold Learning: The Price of Normalization
58 jmlr-2008-Max-margin Classification of Data with Absent Features
59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
60 jmlr-2008-Minimal Nonlinear Distortion Principle for Nonlinear Independent Component Analysis
61 jmlr-2008-Mixed Membership Stochastic Blockmodels
68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
71 jmlr-2008-On the Equivalence of Linear Dimensionality-Reducing Transformations
72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding
74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
80 jmlr-2008-Ranking Individuals by Group Comparisons
81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
82 jmlr-2008-Responses to Evidence Contrary to the Statistical View of Boosting
83 jmlr-2008-Robust Submodular Observation Selection
84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
85 jmlr-2008-Shark (Machine Learning Open Source Software Paper)
87 jmlr-2008-Stationary Features and Cat Detection
88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective
91 jmlr-2008-Trust Region Newton Method for Logistic Regression
92 jmlr-2008-Universal Multi-Task Kernels
93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
96 jmlr-2008-Visualizing Data using t-SNE