jmlr jmlr2008 knowledge-graph by maker-knowledge-mining

jmlr 2008 knowledge graph

papers list:

1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park

Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefﬁcients of the linear support vector machine. A Bahadur type representation of the coefﬁcients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classiﬁcation, convexity lemma, Radon transform

2 jmlr-2008-A Library for Locally Weighted Projection Regression (Machine Learning Open Source Software Paper)

Author: Stefan Klanke, Sethu Vijayakumar, Stefan Schaal

Abstract: In this paper we introduce an improved implementation of locally weighted projection regression (LWPR), a supervised learning algorithm that is capable of handling high-dimensional input data. As the key features, our code supports multi-threading, is available for multiple platforms, and provides wrappers for several programming languages. Keywords: regression, local learning, online learning, C, C++, Matlab, Octave, Python

3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

Author: Bernadetta Tarigan, Sara A. van de Geer

Abstract: The success of support vector machines in binary classiﬁcation relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than ﬁnding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classiﬁcation, all-at-once, moment bound, fast rate, entropy

4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters

Author: Zach Jorgensen, Yan Zhou, Meador Inge

Abstract: Statistical spam ﬁlters are known to be vulnerable to adversarial attacks. One of the more common adversarial attacks, known as the good word attack, thwarts spam ﬁlters by appending to spam messages sets of “good” words, which are words that are common in legitimate email but rare in spam. We present a counterattack strategy that attempts to differentiate spam from legitimate email in the input space by transforming each email into a bag of multiple segments, and subsequently applying multiple instance logistic regression on the bags. We treat each segment in the bag as an instance. An email is classiﬁed as spam if at least one instance in the corresponding bag is spam, and as legitimate if all the instances in it are legitimate. We show that a classiﬁer using our multiple instance counterattack strategy is more robust to good word attacks than its single instance counterpart and other single instance learners commonly used in the spam ﬁltering domain. Keywords: spam ﬁltering, multiple instance learning, good word attack, adversarial learning

5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

Author: Arnak S. Dalalyan, Anatoly Juditsky, Vladimir Spokoiny

Abstract: The statistical problem of estimating the effective dimension-reduction (EDR) subspace in the multi-index regression model with deterministic design and additive noise is considered. A new procedure for recovering the directions of the EDR subspace is proposed. Many methods for estimating the EDR subspace perform principal component analysis on a family of vectors, say ˆ ˆ β1 , . . . , βL , nearly lying in the EDR subspace. This is in particular the case for the structure-adaptive approach proposed by Hristache et al. (2001a). In the present work, we propose to estimate the projector onto the EDR subspace by the solution to the optimization problem minimize ˆ ˆ max β (I − A)β =1,...,L subject to A ∈ Am∗ , where Am∗ is the set of all symmetric matrices with eigenvalues in [0, 1] and trace less than or equal √ to m∗ , with m∗ being the true structural dimension. Under mild assumptions, n-consistency of the proposed procedure is proved (up to a logarithmic factor) in the case when the structural dimension is not larger than 4. Moreover, the stochastic error of the estimator of the projector onto the EDR subspace is shown to depend on L logarithmically. This enables us to use a large number of vectors ˆ β for estimating the EDR subspace. The empirical behavior of the algorithm is studied through numerical simulations. Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, central subspace

6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs

Author: Xianchao Xie, Zhi Geng

Abstract: In this paper, we propose a recursive method for structural learning of directed acyclic graphs (DAGs), in which a problem of structural learning for a large DAG is ﬁrst decomposed into two problems of structural learning for two small vertex subsets, each of which is then decomposed recursively into two problems of smaller subsets until none subset can be decomposed further. In our approach, search for separators of a pair of variables in a large DAG is localized to small subsets, and thus the approach can improve the efﬁciency of searches and the power of statistical tests for structural learning. We show how the recent advances in the learning of undirected graphical models can be employed to facilitate the decomposition. Simulations are given to demonstrate the performance of the proposed method. Keywords: Bayesian network, conditional independence, decomposition, directed acyclic graph, structural learning

7 jmlr-2008-A Tutorial on Conformal Prediction

Author: Glenn Shafer, Vladimir Vovk

Abstract: Conformal prediction uses past experience to determine precise levels of conÄ?Ĺš dence in new predictions. Given an error probability ĂŽÄž, together with a method that makes a prediction y of a label Ă‹† y, it produces a set of labels, typically containing y, that also contains y with probability 1 Ă˘ˆ’ ĂŽÄž. Ă‹† Conformal prediction can be applied to any method for producing y: a nearest-neighbor method, a Ă‹† support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 Ă˘ˆ’ ĂŽÄž of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005). Keywords: conÄ?Ĺš dence, on-line compression modeling, on-line learning, prediction regions

8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses

Author: Faustino Gomez, Jürgen Schmidhuber, Risto Miikkulainen

Abstract: Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difﬁcult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning (RL) approaches have made progress by using direct interaction with the task environment, but have so far not scaled well to large state spaces and environments that are not fully observable. In recent years, neuroevolution, the artiﬁcial evolution of neural networks, has had remarkable success in tasks that exhibit these two properties. In this paper, we compare a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difﬁcult versions of the pole balancing problem that involve large (continuous) state spaces and hidden state. CoSyNE is shown to be signiﬁcantly more efﬁcient and powerful than the other methods on these tasks. Keywords: coevolution, recurrent neural networks, non-linear control, genetic algorithms, experimental comparison

9 jmlr-2008-Active Learning by Spherical Subdivision

Author: Falk-Florian Henrich, Klaus Obermayer

Abstract: We introduce a computationally feasible, “constructive” active learning method for binary classiﬁcation. The learning algorithm is initially formulated for separable classiﬁcation problems, for a hyperspherical data space with constant data density, and for great spheres as classiﬁers. In order to reduce computational complexity the version space is restricted to spherical simplices and learning procedes by subdividing the edges of maximal length. We show that this procedure optimally reduces a tight upper bound on the generalization error. The method is then extended to other separable classiﬁcation problems using products of spheres as data spaces and isometries induced by charts of the sphere. An upper bound is provided for the probability of disagreement between classiﬁers (hence the generalization error) for non-constant data densities on the sphere. The emphasis of this work lies on providing mathematically exact performance estimates for active learning strategies. Keywords: active learning, spherical subdivision, error bounds, simplex halving

10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs (Special Topic on Causality)

Author: Yang-Bo He, Zhi Geng

Abstract: The causal discovery from data is important for various scientiﬁc investigations. Because we cannot distinguish the different directed acyclic graphs (DAGs) in a Markov equivalence class learned from observational data, we have to collect further information on causal structures from experiments with external interventions. In this paper, we propose an active learning approach for discovering causal structures in which we ﬁrst ﬁnd a Markov equivalence class from observational data, and then we orient undirected edges in every chain component via intervention experiments separately. In the experiments, some variables are manipulated through external interventions. We discuss two kinds of intervention experiments, randomized experiment and quasi-experiment. Furthermore, we give two optimal designs of experiments, a batch-intervention design and a sequential-intervention design, to minimize the number of manipulated variables and the set of candidate structures based on the minimax and the maximum entropy criteria. We show theoretically that structural learning can be done locally in subgraphs of chain components without need of checking illegal v-structures and cycles in the whole network and that a Markov equivalence subclass obtained after each intervention can still be depicted as a chain graph. Keywords: active learning, causal networks, directed acyclic graphs, intervention, Markov equivalence class, optimal design, structural learning

11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

Author: Sébastien Loustau

Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classiﬁer with no tuning parameter. It is a combination of SVM classiﬁers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classiﬁcation, support vector machines, learning rates, approximation, aggregation of classiﬁers

12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

Author: Suhrid Balakrishnan, David Madigan

Abstract: Classiﬁers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classiﬁers, etc., provide competitive methods for classiﬁcation problems in high dimensions. However, current algorithms for training sparse classiﬁers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multipass algorithms for training sparse linear classiﬁers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive data sets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function. Keywords: Laplace approximation, expectation propagation, LASSO

13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

Author: Eric Bax, Augusto Callejas

Abstract: This paper introduces a new PAC transductive error bound for classiﬁcation. The method uses information from the training examples and inputs of working examples to develop a set of likely assignments to outputs of the working examples. A likely assignment with maximum error determines the bound. The method is very effective for small data sets. Keywords: error bound, transduction, nearest neighbor, dynamic programming

14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons

Author: Salvador García, Francisco Herrera

Abstract: In a recently published paper in JMLR, Demˇar (2006) recommends a set of non-parametric stas tistical tests and procedures which can be safely used for comparing the performance of classiﬁers over multiple data sets. After studying the paper, we realize that the paper correctly introduces the basic procedures and some of the most advanced ones when comparing a control method. However, it does not deal with some advanced topics in depth. Regarding these topics, we focus on more powerful proposals of statistical procedures for comparing n × n classiﬁers. Moreover, we illustrate an easy way of obtaining adjusted and comparable p-values in multiple comparison procedures. Keywords: statistical methods, non-parametric test, multiple comparisons tests, adjusted p-values, logically related hypotheses

15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)

Author: Gerda Claeskens, Christophe Croux, Johan Van Kerckhoven

Abstract: Support vector machines for classiﬁcation have the advantage that the curse of dimensionality is circumvented. It has been shown that a reduction of the dimension of the input space leads to even better results. For this purpose, we propose two information criteria which can be computed directly from the deﬁnition of the support vector machine. We assess the predictive performance of the models selected by our new criteria and compare them to existing variable selection techniques in a simulation study. The simulation results show that the new criteria are competitive in terms of generalization error rate while being much easier to compute. We arrive at the same ﬁndings for comparison on some real-world benchmark data sets. Keywords: information criterion, supervised classiﬁcation, support vector machine, variable selection

16 jmlr-2008-Approximations for Binary Gaussian Process Classification

Author: Hannes Nickisch, Carl Edward Rasmussen

Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classiﬁcation. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classiﬁcation, Laplaces’s approximation, expectation propagation, variational bounding, mean ﬁeld methods, marginal likelihood evidence, MCMC

17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

Author: David C. Hoyle

Abstract: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞ at ﬁxed ratio α = N/d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. Keywords: PCA, Bayesian model selection, random matrix theory, high dimensional inference

18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefﬁcients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identiﬁcation from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in signiﬁcant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identiﬁcation application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression

Author: Andreas Christmann, Arnout Van Messem

Abstract: We investigate robustness properties for a broad class of support vector machines with non-smooth loss functions. These kernel methods are inspired by convex risk minimization in inﬁnite dimensional Hilbert spaces. Leading examples are the support vector machine based on the ε-insensitive loss function, and kernel based quantile regression based on the pinball loss function. Firstly, we propose with the Bouligand inﬂuence function (BIF) a modiﬁcation of F.R. Hampel’s inﬂuence function. The BIF has the advantage of being positive homogeneous which is in general not true for Hampel’s inﬂuence function. Secondly, we show that many support vector machines based on a Lipschitz continuous loss function and a bounded kernel have a bounded BIF and are thus robust in the sense of robust statistics based on inﬂuence functions. Keywords: Bouligand derivatives, empirical risk minimization, inﬂuence function, robustness, support vector machines

20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)

Author: Jiji Zhang

Abstract: Causal reasoning is primarily concerned with what would happen to a system under external interventions. In particular, we are often interested in predicting the probability distribution of some random variables that would result if some other variables were forced to take certain values. One prominent approach to tackling this problem is based on causal Bayesian networks, using directed acyclic graphs as causal diagrams to relate post-intervention probabilities to pre-intervention probabilities that are estimable from observational data. However, such causal diagrams are seldom fully testable given observational data. In consequence, many causal discovery algorithms based on data-mining can only output an equivalence class of causal diagrams (rather than a single one). This paper is concerned with causal reasoning given an equivalence class of causal diagrams, represented by a (partial) ancestral graph. We present two main results. The ﬁrst result extends Pearl (1995)’s celebrated do-calculus to the context of ancestral graphs. In the second result, we focus on a key component of Pearl’s calculus—the property of invariance under interventions, and give stronger graphical conditions for this property than those implied by the ﬁrst result. The second result also improves the earlier, similar results due to Spirtes et al. (1993). Keywords: ancestral graphs, causal Bayesian network, do-calculus, intervention

21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

22 jmlr-2008-Closed Sets for Labeled Data

23 jmlr-2008-Comments on the Complete Characterization of a Family of Solutions to a GeneralizedFisherCriterion

24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)

25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

26 jmlr-2008-Consistency of Trace Norm Minimization

27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers