jmlr jmlr2006 knowledge-graph by maker-knowledge-mining

jmlr 2006 knowledge graph

papers list:

1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır

Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classiﬁer, whose key component is a weight vector in a feature space implicitly introduced by a positive deﬁnite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modiﬁed optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classiﬁers. These classiﬁers essentially ﬁnd a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the diﬀerent classes of data are linearly well separated. Experimental results over several classiﬁcation benchmarks demonstrate the eﬀectiveness of our approach. Keywords: sparse learning, sparse large margin classiﬁers, kernel learning algorithms, support vector machine, kernel Fisher discriminant

2 jmlr-2006-A Graphical Representation of Equivalence Classes of AMP Chain Graphs

Author: Alberto Roverato, Milan Studený

Abstract: This paper deals with chain graph models under alternative AMP interpretation. A new representative of an AMP Markov equivalence class, called the largest deﬂagged graph, is proposed. The representative is based on revealed internal structure of the AMP Markov equivalence class. More speciﬁcally, the AMP Markov equivalence class decomposes into ﬁner strong equivalence classes and there exists a distinguished strong equivalence class among those forming the AMP Markov equivalence class. The largest deﬂagged graph is the largest chain graph in that distinguished strong equivalence class. A composed graphical procedure to get the largest deﬂagged graph on the basis of any AMP Markov equivalent chain graph is presented. In general, the largest deﬂagged graph differs from the AMP essential graph, which is another representative of the AMP Markov equivalence class. Keywords: chain graph, AMP Markov equivalence, strong equivalence, largest deﬂagged graph, component merging procedure, deﬂagging procedure, essential graph

3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection

Author: Hichem Sahbi, Donald Geman

Abstract: We introduce a computational design for pattern detection based on a tree-structured network of support vector machines (SVMs). An SVM is associated with each cell in a recursive partitioning of the space of patterns (hypotheses) into increasingly ﬁner subsets. The hierarchy is traversed coarse-to-ﬁne and each chain of positive responses from the root to a leaf constitutes a detection. Our objective is to design and build a network which balances overall error and computation. Initially, SVMs are constructed for each cell with no constraints. This “free network” is then perturbed, cell by cell, into another network, which is “graded” in two ways: ﬁrst, the number of support vectors of each SVM is reduced (by clustering) in order to adjust to a pre-determined, increasing function of cell depth; second, the decision boundaries are shifted to preserve all positive responses from the original set of training data. The limits on the numbers of clusters (virtual support vectors) result from minimizing the mean computational cost of collecting all detections subject to a bound on the expected number of false positives. When applied to detecting faces in cluttered scenes, the patterns correspond to poses and the free network is already faster and more accurate than applying a single pose-speciﬁc SVM many times. The graded network promotes very rapid processing of background regions while maintaining the discriminatory power of the free network. Keywords: statistical learning, hierarchy of classiﬁers, coarse-to-ﬁne computation, support vector machines, face detection

4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery

Author: Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvärinen, Antti Kerminen

Abstract: In recent years, several methods have been proposed for the discovery of causal structure from non-experimental data. Such methods make various assumptions on the data generating process to facilitate its identiﬁcation from purely observational data. Continuing this line of research, we show how to discover the complete causal structure of continuous-valued data, under the assumptions that (a) the data generating process is linear, (b) there are no unobserved confounders, and (c) disturbance variables have non-Gaussian distributions of non-zero variances. The solution relies on the use of the statistical method known as independent component analysis, and does not require any pre-speciﬁed time-ordering of the variables. We provide a complete Matlab package for performing this LiNGAM analysis (short for Linear Non-Gaussian Acyclic Model), and demonstrate the effectiveness of the method using artiﬁcially generated data and real-world data. Keywords: independent component analysis, non-Gaussianity, causal discovery, directed acyclic graph, non-experimental data

5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

Author: Robert Castelo, Alberto Roverato

Abstract: Learning of large-scale networks of interactions from microarray data is an important and challenging problem in bioinformatics. A widely used approach is to assume that the available data constitute a random sample from a multivariate distribution belonging to a Gaussian graphical model. As a consequence, the prime objects of inference are full-order partial correlations which are partial correlations between two variables given the remaining ones. In the context of microarray data the number of variables exceed the sample size and this precludes the application of traditional structure learning procedures because a sampling version of full-order partial correlations does not exist. In this paper we consider limited-order partial correlations, these are partial correlations computed on marginal distributions of manageable size, and provide a set of rules that allow one to assess the usefulness of these quantities to derive the independence structure of the underlying Gaussian graphical model. Furthermore, we introduce a novel structure learning procedure based on a quantity, obtained from limited-order partial correlations, that we call the non-rejection rate. The applicability and usefulness of the procedure are demonstrated by both simulated and real data. Keywords: Gaussian distribution, gene network, graphical model, microarray data, non-rejection rate, partial correlation, small-sample inference

6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

Author: Luis M. de Campos

Abstract: We propose a new scoring function for learning Bayesian networks from data using score+search algorithms. This is based on the concept of mutual information and exploits some well-known properties of this measure in a novel way. Essentially, a statistical independence test based on the chi-square distribution, associated with the mutual information measure, together with a property of additive decomposition of this measure, are combined in order to measure the degree of interaction between each variable and its parent variables in the network. The result is a non-Bayesian scoring function called MIT (mutual information tests) which belongs to the family of scores based on information theory. The MIT score also represents a penalization of the Kullback-Leibler divergence between the joint probability distributions associated with a candidate network and with the available data set. Detailed results of a complete experimental evaluation of the proposed scoring function and its comparison with the well-known K2, BDeu and BIC/MDL scores are also presented. Keywords: Bayesian networks, scoring functions, learning, mutual information, conditional independence tests

7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

Author: Shalabh Bhatnagar, Vivek S. Borkar, Madhukar Akarapu

Abstract: We study the problem of long-run average cost control of Markov chains conditioned on a rare event. In a related recent work, a simulation based algorithm for estimating performance measures associated with a Markov chain conditioned on a rare event has been developed. We extend ideas from this work and develop an adaptive algorithm for obtaining, online, optimal control policies conditioned on a rare event. Our algorithm uses three timescales or step-size schedules. On the slowest timescale, a gradient search algorithm for policy updates that is based on one-simulation simultaneous perturbation stochastic approximation (SPSA) type estimates is used. Deterministic perturbation sequences obtained from appropriate normalized Hadamard matrices are used here. The fast timescale recursions compute the conditional transition probabilities of an associated chain by obtaining solutions to the multiplicative Poisson equation (for a given policy estimate). Further, the risk parameter associated with the value function for a given policy estimate is updated on a timescale that lies in between the two scales above. We brieﬂy sketch the convergence analysis of our algorithm and present a numerical application in the setting of routing multiple ﬂows in communication networks. Keywords: Markov decision processes, optimal control conditioned on a rare event, simulation based algorithms, SPSA with deterministic perturbations, reinforcement learning

8 jmlr-2006-A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis

Author: Enrique Castillo, Bertha Guijarro-Berdiñas, Oscar Fontenla-Romero, Amparo Alonso-Betanzos

Abstract: This paper introduces a learning method for two-layer feedforward neural networks based on sensitivity analysis, which uses a linear training algorithm for each of the two layers. First, random values are assigned to the outputs of the ﬁrst layer; later, these initial values are updated based on sensitivity formulas, which use the weights in each of the layers; the process is repeated until convergence. Since these weights are learnt solving a linear system of equations, there is an important saving in computational time. The method also gives the local sensitivities of the least square errors with respect to input and output data, with no extra computational cost, because the necessary information becomes available without extra calculations. This method, called the Sensitivity-Based Linear Learning Method, can also be used to provide an initial set of weights, which signiﬁcantly improves the behavior of other learning algorithms. The theoretical basis for the method is given and its performance is illustrated by its application to several examples in which it is compared with several learning algorithms and well known data sets. The results have shown a learning speed generally faster than other existing methods. In addition, it can be used as an initialization tool for other well known methods with signiﬁcant improvements. Keywords: supervised learning, neural networks, linear optimization, least-squares, initialization method, sensitivity analysis

9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to inﬁnity. We derive probabilistic ﬁnite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reﬂecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a signiﬁcant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Author: Eyal Even-Dar, Shie Mannor, Yishay Mansour

Abstract: We incorporate statistical conﬁdence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it sufﬁces to pull the arms a total of O (n/ε2 ) log(1/δ) times to ﬁnd an ε-optimal arm with probability of at least 1 − δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the conﬁdence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.

11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

Author: Masashi Sugiyama

Abstract: The goal of active learning is to determine the locations of training input points so that the generalization error is minimized. We discuss the problem of active learning in linear regression scenarios. Traditional active learning methods using least-squares learning often assume that the model used for learning is correctly speciﬁed. In many practical situations, however, this assumption may not be fulﬁlled. Recently, active learning methods using “importance”-weighted least-squares learning have been proposed, which are shown to be robust against misspeciﬁcation of models. In this paper, we propose a new active learning method also using the weighted least-squares learning, which we call ALICE (Active Learning using the Importance-weighted least-squares learning based on Conditional Expectation of the generalization error). An important difference from existing methods is that we predict the conditional expectation of the generalization error given training input points, while existing methods predict the full expectation of the generalization error. Due to this difference, the training input design can be ﬁne-tuned depending on the realization of training input points. Theoretically, we prove that the proposed active learning criterion is a more accurate predictor of the single-trial generalization error than the existing criterion. Numerical studies with toy and benchmark data sets show that the proposed method compares favorably to existing methods. Keywords: Active Learning, Conditional Expectation of Generalization Error, Misspeciﬁcation of Models, Importance-Weighted Least-Squares Learning, Covariate Shift.

12 jmlr-2006-Active Learning with Feedback on Features and Instances

Author: Hema Raghavan, Omid Madani, Rosie Jones

Abstract: We extend the traditional active learning framework to include feedback on features in addition to labeling instances, and we execute a careful study of the effects of feature selection and human feedback on features in the setting of text categorization. Our experiments on a variety of categorization tasks indicate that there is signiﬁcant potential in improving classiﬁer performance by feature re-weighting, beyond that achieved via membership queries alone (traditional active learning) if we have access to an oracle that can point to the important (most predictive) features. Our experiments on human subjects indicate that human feedback on feature relevance can identify a sufﬁcient proportion of the most relevant features (over 50% in our experiments). We ﬁnd that on average, labeling a feature takes much less time than labeling a document. We devise an algorithm that interleaves labeling features and documents which signiﬁcantly accelerates standard active learning in our simulation experiments. Feature feedback can complement traditional active learning in applications such as news ﬁltering, e-mail classiﬁcation, and personalization, where the human teacher can have signiﬁcant knowledge on the relevance of features. Keywords: active learning, feature selection, relevance feedback, term feedback, text classiﬁcation

13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies

Author: Fu Chang, Chin-Chin Lin, Chi-Jen Lu

Abstract: In this paper, we propose a number of adaptive prototype learning (APL) algorithms. They employ the same algorithmic scheme to determine the number and location of prototypes, but differ in the use of samples or the weighted averages of samples as prototypes, and also in the assumption of distance measures. To understand these algorithms from a theoretical viewpoint, we address their convergence properties, as well as their consistency under certain conditions. We also present a soft version of APL, in which a non-zero training error is allowed in order to enhance the generalization power of the resultant classifier. Applying the proposed algorithms to twelve UCI benchmark data sets, we demonstrate that they outperform many instance-based learning algorithms, the k-nearest neighbor rule, and support vector machines in terms of average test accuracy. Keywords: adaptive prototype learning, cluster-based prototypes, consistency, instance-based prototype, pattern classification 1

14 jmlr-2006-An Efficient Implementation of an Active Set Method for SVMs (Special Topic on Machine Learning and Optimization)

Author: Katya Scheinberg

Abstract: We propose an active set algorithm to solve the convex quadratic programming (QP) problem which is the core of the support vector machine (SVM) training. The underlying method is not new and is based on the extensive practice of the Simplex method and its variants for convex quadratic problems. However, its application to large-scale SVM problems is new. Until recently the traditional active set methods were considered impractical for large SVM problems. By adapting the methods to the special structure of SVM problems we were able to produce an efﬁcient implementation. We conduct an extensive study of the behavior of our method and its variations on SVM problems. We present computational results comparing our method with Joachims’ SVM light (see Joachims, 1999). The results show that our method has overall better performance on many SVM problems. It seems to have a particularly strong advantage on more difﬁcult problems. In addition this algorithm has better theoretical properties and it naturally extends to the incremental mode. Since the proposed method solves the standard SVM formulation, as does SVMlight , the generalization properties of these two approaches are identical and we do not discuss them in the paper. Keywords: active set methods, support vector machines, quadratic programming

15 jmlr-2006-Bayesian Network Learning with Parameter Constraints (Special Topic on Machine Learning and Optimization)

Author: Radu Stefan Niculescu, Tom M. Mitchell, R. Bharat Rao

Abstract: The task of learning models for many real-world problems requires incorporating domain knowledge into learning algorithms, to enable accurate learning from a realistic volume of training data. This paper considers a variety of types of domain knowledge for constraining parameter estimates when learning Bayesian networks. In particular, we consider domain knowledge that constrains the values or relationships among subsets of parameters in a Bayesian network with known structure. We incorporate a wide variety of parameter constraints into learning procedures for Bayesian networks, by formulating this task as a constrained optimization problem. The assumptions made in module networks, dynamic Bayes nets and context speciﬁc independence models can be viewed as particular cases of such parameter constraints. We present closed form solutions or fast iterative algorithms for estimating parameters subject to several speciﬁc classes of parameter constraints, including equalities and inequalities among parameters, constraints on individual parameters, and constraints on sums and ratios of parameters, for discrete and continuous variables. Our methods cover learning from both frequentist and Bayesian points of view, from both complete and incomplete data. We present formal guarantees for our estimators, as well as methods for automatically learning useful parameter constraints from data. To validate our approach, we apply it to the domain of fMRI brain image analysis. Here we demonstrate the ability of our system to ﬁrst learn useful relationships among parameters, and then to use them to constrain the training of the Bayesian network, resulting in improved cross-validated accuracy of the learned model. Experiments on synthetic data are also presented. Keywords: Bayesian networks, constrained optimization, domain knowledge c 2006 Radu Stefan Niculescu, Tom Mitchell and Bharat Rao. N ICULESCU , M ITCHELL AND R AO

16 jmlr-2006-Bounds for Linear Multi-Task Learning

Author: Andreas Maurer

Abstract: We give dimension-free and data-dependent bounds for linear multi-task learning where a common linear operator is chosen to preprocess data for a vector of task speciﬁc linear-thresholding classiﬁers. The complexity penalty of multi-task learning is bounded by a simple expression involving the margins of the task-speciﬁc classiﬁers, the Hilbert-Schmidt norm of the selected preprocessor and the Hilbert-Schmidt norm of the covariance operator for the total mixture of all task distributions, or, alternatively, the Frobenius norm of the total Gramian matrix for the data-dependent version. The results can be compared to state-of-the-art results on linear single-task learning. Keywords: learning to learn, transfer learning, multi-task learning

17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

Author: Magnus Ekdahl, Timo Koski

Abstract: In many pattern recognition/classiﬁcation problem the true class conditional model and class probabilities are approximated for reasons of reducing complexity and/or of statistical estimation. The approximated classiﬁer is expected to have worse performance, here measured by the probability of correct classiﬁcation. We present an analysis valid in general, and easily computable formulas for estimating the degradation in probability of correct classiﬁcation when compared to the optimal classiﬁer. An example of an approximation is the Na¨ve Bayes classiﬁer. We show that the perforı mance of the Na¨ve Bayes depends on the degree of functional dependence between the features ı and labels. We provide a sufﬁcient condition for zero loss of performance, too. Keywords: Bayesian networks, na¨ve Bayes, plug-in classiﬁer, Kolmogorov distance of variation, ı variational learning

18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)

Author: S. Sathiya Keerthi, Olivier Chapelle, Dennis DeCoste

Abstract: Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classiﬁcation speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily ﬁnds a set of kernel basis functions of a speciﬁed maximum size (dmax ) to approximate the SVM primal cost function well; (3) it is 2 efﬁcient and roughly scales as O(ndmax ) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors. Keywords: SVMs, classiﬁcation, sparse design

19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs

Author: Anders Jonsson, Andrew Barto

Abstract: We present Variable Inﬂuence Structure Analysis, or VISA, an algorithm that performs hierarchical decomposition of factored Markov decision processes. VISA uses a dynamic Bayesian network model of actions, and constructs a causal graph that captures relationships between state variables. In tasks with sparse causal graphs VISA exploits structure by introducing activities that cause the values of state variables to change. The result is a hierarchy of activities that together represent a solution to the original task. VISA performs state abstraction for each activity by ignoring irrelevant state variables and lower-level activities. In addition, we describe an algorithm for constructing compact models of the activities introduced. State abstraction and compact activity models enable VISA to apply efﬁcient algorithms to solve the stand-alone subtask associated with each activity. Experimental results show that the decomposition introduced by VISA can signiﬁcantly accelerate construction of an optimal, or near-optimal, policy. Keywords: Markov decision processes, hierarchical decomposition, state abstraction

20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

Author: Jelle R. Kok, Nikos Vlassis

Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efﬁcient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation

21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis

22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

25 jmlr-2006-Distance Patterns in Structural Similarity

26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming (Special Topic on Machine Learning and Optimization)

28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

30 jmlr-2006-Evolutionary Function Approximation for Reinforcement Learning

31 jmlr-2006-Exact 1-Norm Support Vector Machines Via Unconstrained Convex Differentiable Minimization (Special Topic on Machine Learning and Optimization)

32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems